How We Boosted a Payment Service from 40TPS to 60TPS: Real-World Backend Optimizations
An in‑depth case study of a no‑card payment system reveals common backend pitfalls—database deadlocks, long‑running transactions, thread‑pool misuse, excessive logging, and missing monitoring—and demonstrates practical fixes using Redis locks, refined transaction scopes, controlled thread pools, and optimized logging to dramatically improve scalability and reliability.
Introduction
Hello everyone, it's been a while since I discussed technology with you. Today I will share the performance evolution of the card‑less payment project I am responsible for.
Server Environment
Servers: 4‑core CPU, 8 GB RAM, 4 instances
MQ: RabbitMQ
Database: DB2
SOA framework: internally packaged Dubbo
Cache: Redis, Memcached
Configuration management: internal system
Problem Description
Single node handles 40 TPS; scaling to 4 nodes only reaches 60 TPS – poor scalability.
Frequent database deadlocks causing complete service outage.
Improper use of database transactions leading to long lock times.
Memory overflow and CPU saturation in production.
Poor fault tolerance – minor bugs can bring the service down.
Insufficient or useless logging.
Frequent reads of static configuration from the database, causing high I/O.
Multiple WAR packages deployed in a single Tomcat.
Platform bugs reducing availability.
No rate‑limiting, allowing VIP merchants to stress‑test production.
Lack of fallback strategies, leading to long recovery times or brute‑force rollbacks.
No proper monitoring to detect bottlenecks in real time.
Optimization Solutions
1. Database Deadlock Mitigation
Example of a deadlock scenario is shown below:
The deadlock occurs because sessions A and B wait on each other, often caused by mixing FOR UPDATE with gap locks and next‑key locks.
Instead of pessimistic locking, we adopted three approaches:
Use Redis distributed locks with sharding; if a node fails, others take over.
Apply primary‑key based deduplication: duplicate inserts raise a unique‑constraint error.
Introduce version‑number based optimistic locking, all with expiration times.
2. Reducing Transaction Duration
Problematic code example:
public void test() {
Transaction.begin(); // start transaction
try {
dao.insert();
httpClient.queryRemoteResult(); // remote call inside transaction
dao.update();
Transaction.commit();
} catch (Exception e) {
Transaction.rollback();
}
}Long‑running remote calls inside a transaction inflate lock time and hurt concurrency. The principle is to keep transactions short and move non‑essential work outside.
3. CPU Saturation Analysis
During load testing, CPU remained high. Investigation revealed two main causes:
Database connection pool (C3P0) performance degradation under high concurrency.
Improper thread‑pool usage: an unbounded cached thread pool created thousands of threads.
Original code:
private static final ExecutorService executorService = Executors.newCachedThreadPool();Replacing it with a fixed pool limited threads:
private static final ExecutorService executorService = Executors.newFixedThreadPool(50);However, a fixed pool with an unbounded queue can still cause task backlog under extreme load.
Final thread‑pool strategy:
Option 1: Use a bounded queue with a reasonable pool size and move asynchronous tasks to a dedicated task processor.
Option 2: Adopt Akka for actor‑based concurrency (reference link omitted).
4. Logging Improvements
Problematic logging example:
QuataDTO quataDTO = null;
try {
quataDTO = getRiskLimit(...);
} catch (Exception e) {
logger.info("获取风控限额异常", e);
}Best practices:
Log errors with logger.error or logger.warn.
Include system source, error description, and key information in the message.
Avoid logging only e.getMessage(); include stack trace when appropriate.
Recommended log format:
logger.warn("[innersys] - [" + exceptionType.description + "] - [" + methodName + "] - errorCode:[" + errorCode + "], errorMsg:[" + errorMsg + "]", e);
logger.info("[innersys] - [入参] - [" + methodName + "] - " + LogInfoEncryptUtil.getLogString(arguments));
logger.info("[innersys] - [返回结果] - [" + methodName + "] - " + LogInfoEncryptUtil.getLogString(result));Excessive logging also caused thread blocking; adjusting Log4j pattern from %d %-5p %c:%L [%t] - %m%n to %d %-5p %c [%t] - %m%n reduced contention and improved throughput, as shown in the following charts:
After these optimizations, the service achieved higher TPS, lower latency, and more stable operation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
