Boosting a Payment System from 40TPS to 60TPS: Real-World Backend Performance Hacks
This article walks through a real‑world payment service’s performance evolution, detailing the server environment, a dozen common bottlenecks such as database deadlocks, long‑running transactions, CPU saturation, thread‑pool misuse, logging overload, cache issues, and provides concrete code‑level optimizations, architectural changes, and monitoring tips that raised throughput and stability.
Introduction
In this post the author shares the performance evolution of a payment‑processing project he is responsible for, focusing on code‑level optimizations rather than high‑level architecture.
Server Environment
Four servers, each with 4‑core CPU and 8 GB RAM. The stack includes RabbitMQ, DB2, an internal Dubbo‑based SOA framework, Redis and Memcached for caching, and a custom configuration‑management system.
Problem Description
Single‑node capacity of 40 TPS; adding three more nodes only reaches 60 TPS, indicating poor scalability.
Frequent database deadlocks causing complete service outage.
Improper use of database transactions leading to excessively long lock times.
Regular memory‑overflow and CPU‑saturation incidents in production.
Poor fault tolerance; a tiny bug can bring the whole service down.
Missing or useless log statements that provide no diagnostic value.
Frequent reads of rarely‑changed configuration data from the database, generating heavy I/O.
Multiple WAR packages deployed in a single Tomcat, causing resource contention.
Underlying platform bugs or feature gaps reducing service availability.
No rate‑limiting on APIs, allowing VIP merchants to stress‑test the production environment.
No degradation strategy; issues lead to long recovery times or blunt rollbacks.
Lack of proper monitoring, preventing real‑time detection of bottlenecks.
Optimization Solutions
1. Database Deadlock Mitigation
The deadlock example shows two sessions waiting on each other because of mixed FOR UPDATE, gap lock, and next‑key lock usage.
Root cause: excessive pessimistic locking for idempotency checks.
Use Redis distributed locks with sharding; a single node failure is tolerable.
Implement idempotency via a primary‑key check table that returns a duplicate‑key error on repeat inserts.
Adopt version‑number based optimistic locking.
All three approaches require an expiration time to release stale locks.
2. Reducing Transaction Duration
Long‑running transactions often mix HTTP client calls or other blocking I/O inside the transaction scope.
Guideline: keep transactions short—extract HTTP calls out of the transactional block.
3. CPU Saturation Analysis
During load testing, CPU usage remained high. Investigation revealed that the default C3P0 connection pool and an unbounded thread pool created thousands of threads, exhausting resources.
Fixes:
Replace C3P0 with a more scalable pool.
Limit thread creation using Executors.newFixedThreadPool(50) and avoid unbounded queues.
Final thread‑pool design options are shown in the following diagrams:
Because the servers have only 2–4 CPU cores, excessive threads degrade performance. The solution moves asynchronous tasks to a dedicated task processor, with a retry mechanism via a task table.
4. Logging Improvements
Current logging mixes logger.error and logger.warn with noisy, low‑value messages, causing disk I/O pressure and thread blocking.
Recommended format:
[System] Error description [KeyInfo] – include cause and effect, and optionally input/output parameters.After reconfiguring Log4j1.2.14, thread‑blocking due to logging dropped dramatically, as shown by the before/after charts.
5. Cache Optimization
Three typical cache problems are identified:
Cache penetration – queries for non‑existent keys repeatedly hit the DB.
Cache concurrency – many threads query DB simultaneously when a cache entry expires.
Cache avalanche – many keys expire at the same moment, flooding the DB.
Solutions:
Store a placeholder (e.g., "&&") for missing keys to prevent DB hits.
Apply a lock around cache‑miss handling so only one thread populates the cache.
Randomize cache TTL (add 1‑5 minutes) to spread expirations.
6. Fault‑Tolerance Enhancements
Illustrates that swallowing DAO exceptions in the service layer does not constitute fault tolerance.
Proposes a hybrid cache strategy: critical data (e.g., payment limits) are always fetched from Redis, with a local fallback cache for resilience; less‑critical data can rely on asynchronous sync via MQ or Zookeeper.
7. Incomplete Project Splitting
Deploying multiple WARs in a single Tomcat creates resource contention; the fix is to isolate each WAR in its own Tomcat instance.
8. Platform Component Limitations
Using Future to implement timeouts around a Dubbo call indicates that the underlying Dubbo timeout was ineffective, adding overhead.
9. Quick Bottleneck定位
Combining top to find high‑CPU processes and pstack to inspect thread stacks quickly isolates slow threads. Example output shows thread LWP 30222 consuming 31.4 ms.
10. Index Optimization Tips
Follow the left‑most principle for composite indexes.
Avoid excessive indexes; distinguish between clustered and secondary indexes.
Indexes do not include columns containing NULL values.
MySQL uses only one index per query; avoid ORDER BY on non‑indexed columns.
Applicable operators: >=, BETWEEN, IN, LIKE (without leading %).
Non‑applicable operators: NOT IN, LIKE with leading %.
Prefer numeric columns over strings for indexing to save space and improve I/O.
11. Redis Usage Recommendations
Set expiration times for keys to prevent memory exhaustion.
Keep key names short and values simple; store objects as JSON or Protobuf.
Always return connections to the pool after use.
Conclusion
The systematic analysis and targeted code‑level fixes—ranging from database locking strategies and transaction scoping to thread‑pool sizing, logging hygiene, cache design, and monitoring—raised the service’s throughput, reduced latency, and improved overall stability. The author hints that the next article will cover degradation, rate‑limiting, and monitoring solutions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
