How We Transformed a Legacy System’s Performance: Real‑World Code‑Level Optimizations
This article walks through a real‑world project's performance evolution, detailing server specs, a litany of scalability and reliability problems, and concrete code‑level solutions such as database deadlock mitigation, transaction shortening, thread‑pool redesign, and logging improvements.
Server Environment
Server configuration: 4‑core CPU, 8 GB RAM, four machines. MQ: RabbitMQ. Database: DB2. SOA framework: internally built Dubbo. Cache frameworks: Redis and Memcached. Unified configuration management system: internally developed.
Problem Description
1. Single‑node throughput 40 TPS, scaling to four nodes yields only 60 TPS – poor scalability. 2. Frequent database deadlocks causing complete service outage. 3. Misuse of database transactions leading to excessively long transaction times. 4. Regular memory overflow and CPU saturation in production. 5. Poor fault tolerance; minor bugs often bring the service down. 6. Missing or useless log statements that provide no diagnostic value. 7. Repeated reads of static configuration from the database, generating heavy I/O. 8. Incomplete project isolation – multiple WARs deployed in a single Tomcat. 9. Platform bugs or feature defects reducing availability. 10. No rate‑limiting on APIs, allowing VIP merchants to stress‑test the production environment. 11. Absence of fallback strategies, leading to long recovery times or brute‑force rollbacks. 12. Lack of proper monitoring, preventing real‑time detection of bottlenecks.
Optimization Solutions
1. Database Deadlock Mitigation
Example of a deadlock scenario is illustrated below:
Analysis shows that mixing FOR UPDATE with gap locks and next‑key locks easily creates deadlocks. The original design used pessimistic locking for deduplication, which overloaded the database and limited scalability. Three alternative solutions were adopted:
Use Redis for distributed locking with sharding; a single Redis failure does not halt the system.
Apply primary‑key based deduplication: insert attempts on duplicate orders trigger a unique‑key violation, which the application catches and returns.
Implement version‑based deduplication, ensuring each lock has an expiration time so resources are released when stale.
2. Reducing Transaction Duration
Pseudo‑code example:
public void test() {
Transaction.begin; // start transaction
try {
dao.insert; // insert a row
httpClient.queryRemoteResult(); // remote call
dao.update; // update a row
Transaction.commit(); // commit
} catch (Exception e) {
Transaction.rollback(); // rollback
}
}Mixing remote calls (e.g., httpClient) inside a transaction prolongs transaction time and harms concurrency. The guideline is “fast‑in, fast‑out”: keep transaction code minimal and extract external calls.
3. CPU Consumption Analysis
During load testing, CPU usage remained high. Investigation revealed two main culprits:
Database connection pool : Using C3P0 under high concurrency caused connection checkout timeouts.
Thread‑pool misuse : An unbounded Executors.newCachedThreadPool() created millions of threads, exhausting resources.
Replacing the cached pool with a fixed pool mitigated the thread explosion:
private static final ExecutorService executorService = Executors.newFixedThreadPool(50);However, a fixed pool with an unbounded queue still led to object buildup under extreme load. The final solution consists of two alternatives:
Externalize all asynchronous tasks to a dedicated task processor, with the main application receiving callbacks and a scheduled job re‑dispatching timed‑out tasks.
Adopt the Akka framework for actor‑based concurrency (see previous performance test report).
4. Logging Improvements
Problematic logging example:
QuataDTO quataDTO = null; try {
quataDTO = getRiskLimit(...);
} catch (Exception e) {
logger.info("获取风控限额异常", e);
}Best practices:
Log at error or warn level for exceptions.
Use a structured format: [system] - [error description] - [key info], including method name, error code, and message.
Avoid logging only e.getMessage(); include the stack trace.
Correct logging pattern example:
logger.warn("[innersys] - [" + exceptionType.description + "] - [" + methodName + "] - errorCode:[" + errorCode + "], errorMsg:[" + errorMsg + "]", e);
logger.info("[innersys] - [入参] - [" + methodName + "] - " + LogInfoEncryptUtil.getLogString(arguments));
logger.info("[innersys] - [返回结果] - [" + methodName + "] - " + LogInfoEncryptUtil.getLogString(result));Excessive logging also caused thread blocking due to Log4j’s synchronized I/O. Changing the pattern to a less synchronized format reduced contention and improved throughput, as shown by the subsequent performance graphs.
Conclusion: The next article will continue the discussion on code‑level performance evolution.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
