Databases 38 min read

Scaling an Airline Ticket Order Database: From Monolith to 64‑Shard Sharding

The article details how a rapidly growing airline ticket order system was re‑architected by identifying performance bottlenecks, applying vertical and horizontal sharding, optimizing cache layers, implementing dual‑write mechanisms, and planning a phased migration to achieve ten‑fold QPS growth while reducing resource usage and operational risk.

dbaplus Community
dbaplus Community
dbaplus Community
Scaling an Airline Ticket Order Database: From Monolith to 64‑Shard Sharding

Background and Motivation

The airline ticket order platform experienced rapid growth, leading to severe performance issues: CPU usage over 50% during peaks, disk space exhaustion (>80% usage), and limited scalability that required costly hardware upgrades. To support a projected 10× increase in order volume, the team needed a more scalable database architecture.

Initial Optimizations (Phase 1)

Early efforts focused on conventional techniques:

Index tuning

Read‑write separation

Reducing high‑frequency queries

These measures provided immediate stability at low cost but could not sustain long‑term growth.

Architecture Evolution

Vertical Splitting

Orders were divided by business domain (order management, ticketing, refunds, etc.) into separate databases, improving reliability and reducing cross‑service impact of failures.

Horizontal Splitting (Hot/Cold Separation)

Completed orders were migrated to a read‑only cold database, while active orders remained in a hot database. A restore mechanism allowed selective data recovery from cold to hot when modifications were required.

New Sharding Architecture (Phase 2)

The redesigned system introduced a 64‑shard cluster deployed across 16 physical machines, replacing the previous 1‑to‑2 database layout. Key components include:

Shard Key Selection : Main order ID chosen as the shard key after evaluating stability, uniform load distribution, and minimal cross‑shard queries.

Order‑ID Index Table : Maps sub‑order IDs to main order IDs, enabling fast shard resolution without scanning all shards.

Multi‑Level Cache : Client‑side cache, Redis distributed cache, and server‑side Guava cache reduce index lookups; cache hit rates exceed 99%.

Memory‑Optimized Index Structure : Stores indices as Map<Long, short[]>, cutting memory usage by ~93% (≈3 bytes per index).

Additional optimizations include using modulo‑based bucket assignment so that sub‑order IDs share the same shard as their parent order, eliminating the need for index lookups in many cases.

Cross‑Shard Query Optimization

To handle queries that are not shard‑key based (e.g., by user ID or timestamp), the team introduced:

UID Index Table : Stores user‑ID to order‑ID mappings, allowing quick identification of relevant shards.

Mirror Database : A MySQL instance aggregates hot data from all shards for read‑heavy workloads, synchronized via Canal+QMQ.

Comparative analysis showed MySQL outperformed Elasticsearch for numeric queries, leading to the choice of MySQL for the mirror.

Dual‑Write Component

The existing DAL component was extended to support writing to both SQLServer (legacy) and MySQL (new) databases. Three dual‑write modes were defined:

Asynchronous (AC): Writes to MySQL in a background thread; failures only generate alerts.

Synchronous (SC): Writes to MySQL after successful SQLServer write; failures generate alerts.

Synchronous‑Throw (ST): Failures in MySQL abort the transaction, forcing immediate handling.

Exception handling strategies were adjusted based on which database acted as the primary source.

Fault Tolerance and Shard Isolation

Given the increased number of shards, the system added mechanisms to mitigate partial failures:

Partial Result Return : Queries can continue on healthy shards and return available data along with error information.

Shard Masking : Faulty shards can be temporarily disabled, preventing thread‑pool blockage.

Shard Re‑hashing : When a shard fails, the hash algorithm re‑assigns affected users to alternative shards, reducing impact.

Project Planning and Execution

The migration was organized into six milestones:

API‑level read‑access closure.

Development of dual‑read/dual‑write capabilities.

Data‑consistency verification.

Performance testing and fault‑simulation.

Switching reads from SQLServer to MySQL.

Stopping SQLServer writes and decommissioning legacy tables.

Each stage emphasized risk reduction, extensive task tracking, and cross‑team coordination.

Key Lessons Learned

Clear project scope and task ownership are essential for long‑running initiatives.

Minimizing exceptions and dependencies reduces hidden risks.

One‑thing‑at‑a‑time implementation prevents schema drift and costly rework.

Comprehensive monitoring and automated compensation mechanisms improve reliability.

The effort resulted in a 64‑shard, horizontally scalable order database, CPU utilization dropping from 40% to 3‑5%, storage capacity supporting >5 years of orders, and a significant reduction in overall infrastructure cost.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsDual Writemysqldatabase shardingperformance engineeringcache optimizationhorizontal scalingsqlserver
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.