Evolution and Performance Optimization of Ximalaya’s HTTP Gateway: From Tomcat NIO to Netty Full‑Async Architecture
This article describes how Ximalaya’s high‑traffic HTTP gateway evolved from a Tomcat NIO + AsyncServlet design to a Netty‑based fully asynchronous architecture, detailing the challenges of blocking I/O, memory copying, GC pressure, and how layered redesign, lock‑free connection pools, comprehensive monitoring, and performance optimizations enabled stable handling of over 200 billion daily calls with peak QPS exceeding 40 k per machine.
Gateways are a mature middleware used by many internet companies to decouple public feature rollout; Ximalaya’s gateway processes more than 200 billion calls daily, with single‑machine QPS peaks over 40 k.
First version – Tomcat NIO + AsyncServlet – required asynchronous handling to avoid blocking the Tomcat worker threads during backend calls. A dedicated Push layer used HttpNioClient for features such as black‑white list, flow control, authentication, and API publishing. However, Tomcat’s object pool, memory copy (three copies), and blocking body read caused severe GC pressure and full GC at ~5 k QPS.
Tomcat‑specific problems identified:
Excessive object caching leading to frequent GC.
Heap‑to‑off‑heap memory copy when interacting with Netty services.
Blocking request‑body reads.
HttpNioClient also suffered from lock contention on connection acquire/release, which limited performance under high concurrency.
Second version – Netty + full asynchronous design – replaced Tomcat with a lock‑free, layered architecture, eliminating the above bottlenecks.
Access layer : Netty I/O threads perform HTTP codec, monitor protocol‑level anomalies, enforce request‑line and header size limits, and immediately return 400 for oversized requests.
Business‑logic layer implements a responsibility‑chain handling user authentication, black‑white list (global, application, IP, parameter level), token‑bucket flow control, smart circuit breaking with automatic downgrade, gray release with slow‑start, unified downgrade rules down to parameter level, traffic scheduling and copy, and log sampling for failed requests.
Service‑call layer performs asynchronous remote calls using Netty’s connection pool, managing connections with lock‑free acquisition/release and handling Connection:close, idle timeout, read/write timeout, FIN/RESET scenarios.
Full‑link timeout mechanism covers protocol parsing, queue wait, connection establishment, connection wait, pre‑write timeout check, write timeout, and response timeout.
Monitoring & alarm includes protocol‑layer detection of attack‑style requests (header‑only, oversized body) and application‑layer metrics such as latency (tp99, tp999), QPS, bandwidth, response codes (especially 400/404), connection statistics, failure rates, and traffic jitter alerts.
Performance optimizations involve object‑pool reuse, reducing context switches (asynchronous to synchronous configuration cut CPU switches by ~20 %), GC tuning (large young generation, SurvivorRatio=2, max tenuring age 15), and careful logging to avoid blocking Netty I/O threads. The following finalize method from AbstractPlainSocketImpl illustrates a GC‑related leak mitigation:
/**
* Cleans up if the user forgets to close it.
*/
protected void finalize() throws IOException {
close();
}Log4j’s immediateFlush and bounded AsyncAppender buffers can block I/O threads under heavy logging, so log volume is minimized.
Future plans include migrating to HTTP/2 to multiplex multiple requests per connection, further refining monitoring and alarm accuracy, and enhancing downgrade strategies to ensure graceful degradation across the entire site.
Conclusion : The gateway has become a core, cloud‑native component at Ximalaya, with ongoing work on multi‑active deployment, stability platforms, and continuous performance improvements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
