Backend Development 11 min read

Lessons from QMQ: Network and Disk I/O Problems and Their Mitigations

The article analyzes real‑world network and disk I/O issues encountered in Qunar Message Queue (QMQ), explains root causes such as Netty OOM, file‑handle exhaustion, TCP timeout handling, and large‑traffic bursts, and presents practical mitigation strategies for backend systems.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Lessons from QMQ: Network and Disk I/O Problems and Their Mitigations

QMQ (Qunar Message Queue) was originally built on MySQL storage and later migrated to a file‑based distributed architecture to handle growing message volumes. The article shares practical experiences from Ctrip’s deployment, focusing on two main problem domains: network and disk I/O.

1. Network Issues

1.1 OOM – An out‑of‑memory alarm on a broker slave was caused by off‑heap memory leakage during Netty message reception. The lack of back‑pressure and unchecked auto‑read led to continuous queuing and off‑heap growth. Conclusion: check channel.isWritable() before Netty writes.

1.2 File‑handle Exhaustion – TCP connections to MetaServer failed because the process ran out of file descriptors (limit 65536). Missing idle detection caused leaked connections. Conclusion: implement bidirectional idle detection.

1.3 Broker Not Removed – When a broker became unreachable, the heartbeat mechanism failed to mark it as non‑read/write, causing routing errors. A redesign added periodic DB scans by all MetaServers to mark lost brokers. Conclusion: consider network partition scenarios in distributed designs.

1.4 java.net.SocketTimeoutException – After a network outage, threads blocked on MySQL reads timed out after ~15 minutes due to Linux TCP retransmission timers and missing SO_TIMEOUT . Conclusion: configure SO_TIMEOUT on DataSources.

1.5 Large Traffic Burst – Sudden spikes caused full GC and OOM because Netty’s decode handler placed messages into an unbounded receive queue, delaying off‑heap reclamation. Mitigations included request‑size checks, rate limiting, bounded queues with timeout discard, and I/O latency monitoring. Conclusion: implement back‑pressure mechanisms.

2. Disk I/O Issues

2.1 Accumulated Message Pulls – The shared log file model leads to many random reads for long‑standing messages, increasing I/O utilization. Sorting message files and separating hot/cold data (e.g., mirroring to HBase) were suggested. Conclusion: consider hot‑cold separation.

2.2 Large Messages – Some topics contain messages >100 KB. Enabling producer‑side compression achieved 5‑8× size reduction, reducing disk write volume and I/O pressure. Conclusion: compress large payloads and optimise file encoding.

Finally, the article notes additional real‑world complications (packet loss, TCP retransmission failures, RAID issues, etc.) and outlines future work such as file‑encoding optimisation, page‑cache tuning, consumer pull redirection, and kernel upgrades.

backendPerformance TuningMessage Queuedisk-ioNetwork IOQMQ
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.