Scaling an Airline Ticket Order Database: From Monolith to 64‑Shard Sharding
The article details how a rapidly growing airline ticket order system was re‑architected by identifying performance bottlenecks, applying vertical and horizontal sharding, optimizing cache layers, implementing dual‑write mechanisms, and planning a phased migration to achieve ten‑fold QPS growth while reducing resource usage and operational risk.
Background and Motivation
The airline ticket order platform experienced rapid growth, leading to severe performance issues: CPU usage over 50% during peaks, disk space exhaustion (>80% usage), and limited scalability that required costly hardware upgrades. To support a projected 10× increase in order volume, the team needed a more scalable database architecture.
Initial Optimizations (Phase 1)
Early efforts focused on conventional techniques:
Index tuning
Read‑write separation
Reducing high‑frequency queries
These measures provided immediate stability at low cost but could not sustain long‑term growth.
Architecture Evolution
Vertical Splitting
Orders were divided by business domain (order management, ticketing, refunds, etc.) into separate databases, improving reliability and reducing cross‑service impact of failures.
Horizontal Splitting (Hot/Cold Separation)
Completed orders were migrated to a read‑only cold database, while active orders remained in a hot database. A restore mechanism allowed selective data recovery from cold to hot when modifications were required.
New Sharding Architecture (Phase 2)
The redesigned system introduced a 64‑shard cluster deployed across 16 physical machines, replacing the previous 1‑to‑2 database layout. Key components include:
Shard Key Selection : Main order ID chosen as the shard key after evaluating stability, uniform load distribution, and minimal cross‑shard queries.
Order‑ID Index Table : Maps sub‑order IDs to main order IDs, enabling fast shard resolution without scanning all shards.
Multi‑Level Cache : Client‑side cache, Redis distributed cache, and server‑side Guava cache reduce index lookups; cache hit rates exceed 99%.
Memory‑Optimized Index Structure : Stores indices as Map<Long, short[]>, cutting memory usage by ~93% (≈3 bytes per index).
Additional optimizations include using modulo‑based bucket assignment so that sub‑order IDs share the same shard as their parent order, eliminating the need for index lookups in many cases.
Cross‑Shard Query Optimization
To handle queries that are not shard‑key based (e.g., by user ID or timestamp), the team introduced:
UID Index Table : Stores user‑ID to order‑ID mappings, allowing quick identification of relevant shards.
Mirror Database : A MySQL instance aggregates hot data from all shards for read‑heavy workloads, synchronized via Canal+QMQ.
Comparative analysis showed MySQL outperformed Elasticsearch for numeric queries, leading to the choice of MySQL for the mirror.
Dual‑Write Component
The existing DAL component was extended to support writing to both SQLServer (legacy) and MySQL (new) databases. Three dual‑write modes were defined:
Asynchronous (AC): Writes to MySQL in a background thread; failures only generate alerts.
Synchronous (SC): Writes to MySQL after successful SQLServer write; failures generate alerts.
Synchronous‑Throw (ST): Failures in MySQL abort the transaction, forcing immediate handling.
Exception handling strategies were adjusted based on which database acted as the primary source.
Fault Tolerance and Shard Isolation
Given the increased number of shards, the system added mechanisms to mitigate partial failures:
Partial Result Return : Queries can continue on healthy shards and return available data along with error information.
Shard Masking : Faulty shards can be temporarily disabled, preventing thread‑pool blockage.
Shard Re‑hashing : When a shard fails, the hash algorithm re‑assigns affected users to alternative shards, reducing impact.
Project Planning and Execution
The migration was organized into six milestones:
API‑level read‑access closure.
Development of dual‑read/dual‑write capabilities.
Data‑consistency verification.
Performance testing and fault‑simulation.
Switching reads from SQLServer to MySQL.
Stopping SQLServer writes and decommissioning legacy tables.
Each stage emphasized risk reduction, extensive task tracking, and cross‑team coordination.
Key Lessons Learned
Clear project scope and task ownership are essential for long‑running initiatives.
Minimizing exceptions and dependencies reduces hidden risks.
One‑thing‑at‑a‑time implementation prevents schema drift and costly rework.
Comprehensive monitoring and automated compensation mechanisms improve reliability.
The effort resulted in a 64‑shard, horizontally scalable order database, CPU utilization dropping from 40% to 3‑5%, storage capacity supporting >5 years of orders, and a significant reduction in overall infrastructure cost.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
