How to Build a Scalable Tiered Archive & Query System for MySQL Data
This article presents a comprehensive design for a layered storage and unified scheduling platform that archives MySQL historical data, reduces storage costs, ensures high‑performance queries, and enables efficient data analysis through tiered hot, warm, and cold storage using big‑data technologies.
Overview
The growing volume of business data creates three main challenges: online MySQL databases face storage pressure and performance degradation, historical data storage becomes costly and non‑queryable, and the five data lifecycle stages (extraction, archiving, backup, query, analysis) lack unified management, leading to resource waste and limited data asset utilization.
The goal is to build a layered storage and unified scheduling platform for historical data archiving and querying, achieving cost optimization through hot‑warm‑cold tiering, performance guarantees for online services and analytics, and enhanced data value extraction for business users.
Scope: the solution targets MySQL (MariaDB) business data archiving, storage, and query services.
Architecture Design
The architecture consists of four layers: data collection & synchronization, task scheduling & computation, storage, and access & service.
Data collection & synchronization layer : periodic batch archiving during low‑traffic periods and real‑time binlog subscription for streaming use cases.
Task scheduling & computation layer : Spark jobs for periodic archiving, data cleaning, and computation; real‑time processing via Flink CDC.
Storage layer : hot, warm, and cold data stored on different media.
Access & service layer : gateway and business services, including online services, historical query services, and asynchronous extraction services.
Collaboration workflow:
Historical data query: data warehouse scheduled tasks extract business tables to Hive, then optionally sync warm data to Doris; business systems query via Doris.
Cold data asynchronous extraction: business system submits extraction request, service triggers Spark to package data from object storage, and the system retrieves the compressed file.
Detailed Design
Data Collection & Synchronization Layer
Real‑time binlog listening
Technical components: Canal, Debezium, or Flink CDC.
Selection comparison:
Canal/Debezium – simple data sync, suitable for straightforward change capture.
Flink CDC – integrates capture and real‑time processing, supports complex ETL.
Example pipelines:
// Canal/Debezium – data capture and forwarding
MySQL Binlog → Canal Server → Kafka → custom consumer // Flink CDC – integrated capture and processing
MySQL Binlog → Flink CDC Source → Flink SQL → target storeRecommendation: use Flink CDC for complex real‑time flows; otherwise, Debezium + Kafka + Flink offers a lightweight solution.
Periodic Archiving (Extraction)
Technical components: Apache Spark + Presto, DataX, Sqoop.
Workflow comparison:
Spark + Presto: MySQL → Spark + Presto direct query → results.
DataX/Sqoop: MySQL → DataX/Sqoop → HDFS → Hive SQL → Spark → results.
Recommendation: choose DataX/Sqoop only for simple table‑to‑table sync without transformation; otherwise, Spark + Presto provides stronger capabilities and better performance.
Task Scheduling & Computation Layer
Components:
Scheduler: DolphinScheduler or Apache Airflow.
Compute engine: Apache Spark or Apache Flink.
Scheduler comparison:
DolphinScheduler – visual UI, drag‑and‑drop, easy for Chinese users.
Airflow – Python‑based DAG definition, highly flexible, large community.
Compute engine comparison:
Spark – batch processing champion, mature ecosystem, fast in‑memory computation, ideal for offline Hive writes.
Flink – streaming champion, supports exactly‑once semantics, suitable for real‑time data inflow.
Typical workflows include asynchronous extraction (task submission → scheduler triggers → compute engine generates compressed files) and periodic archiving (trigger archiving task → read MySQL → write to HDFS/Hive → optionally sync to Doris).
Storage Layer
Warm data: Apache Doris (or StarRocks) – high‑performance MPP analytical database supporting low‑latency point queries and complex ad‑hoc queries.
Cold/near‑line data: Apache Hive – batch‑oriented SQL engine on Hadoop, low storage cost, suitable for data older than one month.
Archive storage: Object storage (AWS S3, Alibaba OSS, Tencent COS) – virtually unlimited scalability, high durability, low cost for infrequently accessed historical data.
Access & Service Layer
Historical query service (warm) – provides low‑latency queries to business systems via direct Doris access; enforces time‑partition parameters and integrates Sentinel for rate limiting.
Asynchronous extraction service (cold) – handles bulk historical data extraction; business systems submit tasks, service generates result files in object storage, and returns file paths for downstream retrieval; also protected by Sentinel.
Non‑Functional Design
Performance
Doris outperforms TiDB for range scans, join queries, and analytical workloads, making it a better fit for historical data analysis.
Reliability
Data integrity checks before deletion: verify migration completeness, compare row counts or hash (MD5), sample queries, and automate verification scripts via the scheduler.
Secure deletion: batch‑wise deletion, strict permission isolation, approval workflow, detailed audit logs, and real‑time monitoring (e.g., Prometheus) with alerts.
Scalability
Both TiDB and Doris scale well, but Doris expands faster with automatic balancing and seconds‑level detection, making it more suitable for read‑only historical data.
Cost
Lifecycle management moves most historical data to low‑cost Hive and object storage, keeping only hot/warm data in Doris, dramatically reducing storage expenses compared to storing everything in TiDB.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect-Kip
Daily architecture work and learning summaries. Not seeking lengthy articles—only real practical experience.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
