System Performance Issue Analysis Process and Optimization Practices
This article outlines a comprehensive process for diagnosing and optimizing business system performance problems, covering analysis workflows, influencing factors such as hardware, software, database and middleware, JVM tuning, code inefficiencies, and the use of monitoring and APM tools to improve system reliability.
Today we discuss the analysis, diagnosis, and optimization of performance problems in production business systems, focusing on issues that arise after a system goes live.
System Performance Issue Analysis Process
If a system shows no performance issues before launch but encounters serious problems afterward, the root causes usually fall into three categories: high concurrent access, growing database size, and changes in critical environment factors such as network bandwidth.
When a performance problem is discovered, the first step is to determine whether it occurs under single‑user (non‑concurrent) conditions or only under concurrent load. Single‑user issues are typically easier to test and resolve, while concurrent issues require stress testing in a controlled environment.
Single‑user problems often stem from inefficient code or SQL statements, whereas concurrent problems usually require deeper analysis of database and middleware states, possibly involving performance tuning of the middleware.
During stress testing, it is essential to monitor CPU, memory, and JVM metrics to detect conditions such as memory leaks that can cause performance degradation under load.
Performance Issue Influencing Factors
Performance problems can be attributed to three main aspects: hardware environment, software runtime environment, and the application itself.
Hardware environment includes compute, storage, and network resources. While vendors provide TPMC values for CPU capability, real‑world performance can vary, and storage I/O often becomes a bottleneck that impacts memory usage.
Common Linux monitoring tools (iostat, ps, sar, top, vmstat, etc.) help observe CPU, memory, JVM, and disk I/O to pinpoint the true source of a problem.
Runtime Environment – Database and Middleware
Database performance tuning (using Oracle as an example) involves optimizing disk I/O, rollback segments, redo logs, system global area, and database objects. Monitoring can be enabled with settings such as:
ALTER SESSION SET STATISTICS=TRUE;
TIMED_STATISTICS=TRUE;
-- Run utlbstat.sql during normal activity and utlestat.sql to stop collection.
-- Results are written to report.txt.Continuous database performance monitoring includes inspecting high‑memory usage alerts, excessive redo generation, and inefficient SQL statements.
Application middleware tuning focuses on container configuration (WebLogic, Tomcat, etc.) and JVM parameters. Key JVM options include:
-Xmx // maximum heap size
-Xms // initial heap size
-XX:MaxNewSize // maximum young generation size
-XX:NewSize // initial young generation size
-XX:MaxPermSize // maximum permanent generation (pre‑Metaspace)
-XX:PermSize // initial permanent generation (pre‑Metaspace)
-Xss // thread stack sizeRecommended sizing guidelines suggest setting Xmx/Xms to 3‑4 times the expected old‑generation usage after a Full GC, and adjusting young‑generation and permanent generation sizes proportionally.
Software Code Performance Issues
Often, performance bottlenecks are not caused by hardware limits but by inefficient code, such as creating large objects inside loops, failing to release resources, missing caching strategies, long‑running transactions, or using sub‑optimal data structures and algorithms.
These issues are best discovered through static code analysis tools and thorough code reviews, and should be codified into development standards to prevent recurrence.
Extended Considerations
Pre‑production performance testing may fail to replicate real production conditions due to differences in hardware, data volume, and concurrency, which explains why some issues only surface after launch.
Even with horizontal scaling of databases (e.g., Oracle RAC) and application clusters, performance problems can persist if single‑node performance is poor.
Performance diagnosis can be classified statically into operating‑system/storage, middleware (database and application servers), and software layers (SQL, business logic, front‑end). Dynamically, tracing a request through code and infrastructure helps locate the exact bottleneck.
Monitoring and APM
Modern APM tools provide end‑to‑end visibility, correlating resource usage (CPU, memory) with specific services, SQL statements, and business functions, enabling rapid identification of performance hotspots.
Integrating APM with DevOps practices allows proactive detection and automated analysis of performance degradation, greatly improving troubleshooting efficiency.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.