Why Your Spring Cloud Microservices Stall at High Traffic and How to Fix It
This article examines a real‑world Spring Cloud microservice deployment that performed well with low traffic but suffered severe latency and hangs as user volume grew, analyzes root causes such as massive tables, complex SQL, and misconfigured timeouts, and provides step‑by‑step tuning, retry, and idempotency strategies to restore reliable performance.
1. Introduction
Many developers use Spring Cloud to build microservice architectures, which works fine for low‑traffic internal systems but reveals problems when handling tens of thousands of concurrent requests.
2. Scenario and Emerging Issues
A startup adopted Spring Cloud from the start. After initial development, the system handled a few hundred thousand registered users and a few thousand daily active users without noticeable issues.
Data accumulated in a single table grew to several million rows, and some services executed complex multi‑table SQL queries without proper indexing. Users began experiencing several‑second page hangs.
Friend A’s company built an internet startup, used Spring Cloud for microservices, and after months of development the system seemed stable, but performance degraded as data and traffic grew.
Large single‑table data (millions of rows).
Complex SQL with many joins.
Missing or poorly designed indexes, leading to multi‑second query execution.
Feign + Ribbon calls have configurable timeout settings; when a request exceeds the timeout, an exception is returned and the page never renders.
3. Quick Fixes (Increasing Timeouts)
Faced with slow pages, many teams simply increase the timeout values in Feign, Ribbon, and Hystrix, hoping the request will eventually return.
Spring Cloud typically uses Hystrix thread pools for remote calls, so timeout must be set in two places: Feign/Ribbon and Hystrix (Hystrix timeout should be larger).
After adjusting these parameters, the system appeared to work: pages loaded after a few seconds, but concurrency remained low (only a few dozen requests per second).
4. Problem Escalation
As the company secured funding and rapidly grew its user base, daily active users surged to millions and peak concurrency approached ten thousand requests per second.
Database read/write splitting and master‑slave replication were added, but during peak periods entire service pages would hang, with all threads in the Hystrix pool blocked for seconds.
The root cause: a limited thread pool (dozens of threads) handling calls to a downstream service; each call blocked for ~5 seconds, exhausting the pool.
5. Root Cause Analysis and Fundamental Solutions
Step 1: Optimize the core service (Service B) by simplifying database access—use single‑table queries, avoid large joins, and ensure proper indexes. This reduced response time from seconds to tens of milliseconds.
Step 2: Set realistic timeout values (generally ≤ 1 second). Longer timeouts mask performance problems and cause thread‑pool exhaustion.
Step 3: Configure retry logic with reasonable limits, ensuring that failed calls are retried on another instance before giving up.
Step 4: Guarantee idempotency for any retried operation, e.g., by using unique database indexes or Redis‑based unique identifiers, to prevent duplicate inserts.
Create a unique index in the database to reject duplicate rows.
Use Redis to store a unique ID and check before inserting.
6. Conclusion
Proper performance tuning—optimizing SQL, setting appropriate timeouts, adding retries, and ensuring idempotent operations—transforms a flaky Spring Cloud system into a stable, responsive service even under high concurrency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
