Big Data 13 min read

How to Build a Scalable Tiered Archive & Query System for MySQL Data

This article presents a comprehensive design for a layered storage and unified scheduling platform that archives MySQL historical data, reduces storage costs, ensures high‑performance queries, and enables efficient data analysis through tiered hot, warm, and cold storage using big‑data technologies.

Architect-Kip
Architect-Kip
Architect-Kip
How to Build a Scalable Tiered Archive & Query System for MySQL Data

Overview

The growing volume of business data creates three main challenges: online MySQL databases face storage pressure and performance degradation, historical data storage becomes costly and non‑queryable, and the five data lifecycle stages (extraction, archiving, backup, query, analysis) lack unified management, leading to resource waste and limited data asset utilization.

The goal is to build a layered storage and unified scheduling platform for historical data archiving and querying, achieving cost optimization through hot‑warm‑cold tiering, performance guarantees for online services and analytics, and enhanced data value extraction for business users.

Scope: the solution targets MySQL (MariaDB) business data archiving, storage, and query services.

Architecture Design

The architecture consists of four layers: data collection & synchronization, task scheduling & computation, storage, and access & service.

Data collection & synchronization layer : periodic batch archiving during low‑traffic periods and real‑time binlog subscription for streaming use cases.

Task scheduling & computation layer : Spark jobs for periodic archiving, data cleaning, and computation; real‑time processing via Flink CDC.

Storage layer : hot, warm, and cold data stored on different media.

Access & service layer : gateway and business services, including online services, historical query services, and asynchronous extraction services.

Collaboration workflow:

Historical data query: data warehouse scheduled tasks extract business tables to Hive, then optionally sync warm data to Doris; business systems query via Doris.

Cold data asynchronous extraction: business system submits extraction request, service triggers Spark to package data from object storage, and the system retrieves the compressed file.

Detailed Design

Data Collection & Synchronization Layer

Real‑time binlog listening

Technical components: Canal, Debezium, or Flink CDC.

Selection comparison:

Canal/Debezium – simple data sync, suitable for straightforward change capture.

Flink CDC – integrates capture and real‑time processing, supports complex ETL.

Example pipelines:

// Canal/Debezium – data capture and forwarding
MySQL Binlog → Canal Server → Kafka → custom consumer
// Flink CDC – integrated capture and processing
MySQL Binlog → Flink CDC Source → Flink SQL → target store

Recommendation: use Flink CDC for complex real‑time flows; otherwise, Debezium + Kafka + Flink offers a lightweight solution.

Periodic Archiving (Extraction)

Technical components: Apache Spark + Presto, DataX, Sqoop.

Workflow comparison:

Spark + Presto: MySQL → Spark + Presto direct query → results.

DataX/Sqoop: MySQL → DataX/Sqoop → HDFS → Hive SQL → Spark → results.

Recommendation: choose DataX/Sqoop only for simple table‑to‑table sync without transformation; otherwise, Spark + Presto provides stronger capabilities and better performance.

Task Scheduling & Computation Layer

Components:

Scheduler: DolphinScheduler or Apache Airflow.

Compute engine: Apache Spark or Apache Flink.

Scheduler comparison:

DolphinScheduler – visual UI, drag‑and‑drop, easy for Chinese users.

Airflow – Python‑based DAG definition, highly flexible, large community.

Compute engine comparison:

Spark – batch processing champion, mature ecosystem, fast in‑memory computation, ideal for offline Hive writes.

Flink – streaming champion, supports exactly‑once semantics, suitable for real‑time data inflow.

Typical workflows include asynchronous extraction (task submission → scheduler triggers → compute engine generates compressed files) and periodic archiving (trigger archiving task → read MySQL → write to HDFS/Hive → optionally sync to Doris).

Storage Layer

Warm data: Apache Doris (or StarRocks) – high‑performance MPP analytical database supporting low‑latency point queries and complex ad‑hoc queries.

Cold/near‑line data: Apache Hive – batch‑oriented SQL engine on Hadoop, low storage cost, suitable for data older than one month.

Archive storage: Object storage (AWS S3, Alibaba OSS, Tencent COS) – virtually unlimited scalability, high durability, low cost for infrequently accessed historical data.

Access & Service Layer

Historical query service (warm) – provides low‑latency queries to business systems via direct Doris access; enforces time‑partition parameters and integrates Sentinel for rate limiting.

Asynchronous extraction service (cold) – handles bulk historical data extraction; business systems submit tasks, service generates result files in object storage, and returns file paths for downstream retrieval; also protected by Sentinel.

Non‑Functional Design

Performance

Doris outperforms TiDB for range scans, join queries, and analytical workloads, making it a better fit for historical data analysis.

Reliability

Data integrity checks before deletion: verify migration completeness, compare row counts or hash (MD5), sample queries, and automate verification scripts via the scheduler.

Secure deletion: batch‑wise deletion, strict permission isolation, approval workflow, detailed audit logs, and real‑time monitoring (e.g., Prometheus) with alerts.

Scalability

Both TiDB and Doris scale well, but Doris expands faster with automatic balancing and seconds‑level detection, making it more suitable for read‑only historical data.

Cost

Lifecycle management moves most historical data to low‑cost Hive and object storage, keeping only hot/warm data in Doris, dramatically reducing storage expenses compared to storing everything in TiDB.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

FlinkHiveSparkdata archivingdorisstorage tiering
Architect-Kip
Written by

Architect-Kip

Daily architecture work and learning summaries. Not seeking lengthy articles—only real practical experience.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.