When Ops Gets Blamed: Three Real‑World Troubleshooting Cases and What They Teach
The article shares three detailed production incidents—an Oracle overload, a blockchain media service overload, and a 502 failure in an Nginx‑Tomcat rental app—explaining how the author diagnosed, mitigated, and permanently resolved each problem, and distilling key operational lessons.
In this post the author, an IT operations lead, recounts three real‑world incidents that illustrate how proactive troubleshooting can shift blame away from ops and improve system reliability.
Case 1: Oracle overload behind a payment gateway
A payment system used a front‑end load balancer, Tomcat, Memcached and an Oracle 11g RAC cluster. Rapid business growth outpaced staffing, leading to frequent performance issues. When Oracle connections stalled at around 300, the author temporarily stopped Tomcat (killall -9 java), observed a sharp drop in Oracle load, then restarted Tomcat, causing load to spike above 600.
Further investigation revealed a backend management design flaw: the admin portal queried all agents and their downstream users, and a statistical query used SELECT COUNT() on large tables, causing massive data scans. Additionally, the portal always fetched detailed transaction records for every user, inflating the workload when dozens of agents operated simultaneously.
After identifying the problematic SQL statements via a performance probe, developers corrected the code, eliminating the overload.
Case 2: Blockchain media project overload
A newly launched blockchain‑based media service deployed high‑spec cloud VMs and a load‑balancer. Even without promotion, the cluster’s load approached 1000, threatening crashes.
Checked php-fpm.conf and tweaked parameters.
Ran mysql> SHOW FULL PROCESSLIST to see many connections.
Reviewed web access logs, noticing that the “News” section fetched all records in a single request, causing massive DB reads.
The team confirmed the app retrieved the entire dataset for each page view, regardless of necessity. After discussing with developers, they limited Nginx concurrency (5 requests/second), which reduced load but made the app unusable, confirming that the app was issuing dozens of DB queries per second—an unreasonable pattern. Developers then re‑engineered the data‑fetching logic, resolving the issue.
Case 3: 502 errors in an Nginx + Tomcat rental service
The service ran two Tomcat instances behind Nginx. Prior mitigations (memory limits, running Tomcat as www, daily cron restart, URL monitoring) had kept it stable, but a new deployment caused repeated 502 errors.
Manual Tomcat restart showed the process existed but catalina.out logged exceptions.
Investigation of the webapps directory revealed two extra directories besides ROOT. Removing them and restarting Tomcat restored service.
Further inspection showed the two projects pointed to different databases; the wrong configuration was being used at startup.
Correcting the configuration and cleaning up the deployment eliminated the 502 failures.
Key Takeaways
Effective operations require deep knowledge of both the underlying systems and the business logic they support. Understanding how applications query data, recognizing design flaws, and communicating with developers enable faster root‑cause analysis and prevent ops from being the scapegoat.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
