How to Optimize High‑Concurrency Services (QPS > 200k)
This article outlines practical strategies for handling online services with extremely high request rates—over 200,000 QPS—by avoiding relational databases, employing multi‑level caching, leveraging multithreading, implementing circuit‑breaker and downgrade mechanisms, optimizing I/O, controlling retries, handling edge cases, and logging efficiently.
1: Say No to Relational Databases
A large‑scale C‑end internet service should not rely on relational databases as the primary storage; instead, use NoSQL caches like Redis or Memcached as the main "database" and keep MySQL or Oracle only as an asynchronous backup.
Example: During JD.com’s Double‑11 event, product data is first written to Redis and later asynchronously persisted to MySQL. C‑end queries read directly from Redis, while B‑end queries can still use the database.
2: Multi‑Level Caching
Caching is essential for high concurrency. Redis can handle 60‑80k QPS per node, but its single‑threaded nature and hotspot issues require additional layers.
Typical multi‑level cache stack: local in‑process cache → MemeryCache (multithreaded) → Redis. This hierarchy can absorb millions of QPS in flash‑sale scenarios.
3: Multithreading
Switching from a synchronous loop that reads Redis (≈3 ms per call) over a 300‑400k list to a thread‑pool implementation reduced response time from >30 s to about 3 s, demonstrating the power of multithreading on multi‑core servers.
However, thread pool size and queue length must be tuned and monitored to avoid resource waste.
4: Degradation and Circuit‑Breaker
Both mechanisms protect services from overload. Degradation disables non‑essential features while keeping the main flow alive; circuit‑breaker cuts off calls to an overloaded downstream service and returns failures immediately.
Choosing between them depends on business scenarios.
5: I/O Optimization
Frequent connection creation and teardown increase I/O load. Batch requests whenever possible to reduce the number of downstream calls, especially in high‑traffic product detail queries.
6: Use Retries Wisely
Retry can mitigate transient failures but must be limited in count, spaced appropriately, and configurable; otherwise it can cause cascading failures (e.g., Kafka consumer lag caused by excessive retries).
Control retry count
Set proper retry intervals
Make retry behavior configurable
7: Guard Edge Cases and Provide Fallbacks
Missing checks for edge cases (e.g., empty arrays) can lead to massive data leaks affecting millions of users; simple validation can prevent catastrophic incidents.
8: Log Elegantly
Full‑volume logging at 200k QPS can consume terabytes of disk space daily and increase response latency. Use rate‑limited logging (token bucket) or whitelist‑based logging to reduce noise and resource consumption.
Conclusion
The blog summarizes essential considerations for high‑concurrency services, offering practical advice to maintain reliability and performance while acknowledging that real‑world scenarios can be more complex.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Cloud Native Technology Community
The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
