Operations 21 min read

How a Government System’s Week‑Long Outage Exposed Critical Backend and Load‑Balancing Flaws

A government information system suffered a week of instability, including service deadlocks, Tomcat memory overflows, and load‑balancing failures, prompting a deep forensic analysis that uncovered database lock‑ups, faulty front‑end loops, inadequate monitoring, and misconfigured logging, leading to concrete remediation steps and lessons for future reliability.

dbaplus Community
dbaplus Community
dbaplus Community
How a Government System’s Week‑Long Outage Exposed Critical Backend and Load‑Balancing Flaws

Background

The author’s company builds government information systems. After a weekend upgrade of a core citizen‑facing application, the system experienced a week of instability—frequent crashes, service deadlocks, database table locks, and prolonged outages that severely impacted the agency’s reputation.

System Architecture

The pre‑incident architecture was a traditional Spring MVC + MyBatis stack running on Windows Server 2008, without a distributed framework. Load balancing was implemented with a simple hardware device using IP‑hash routing, and monitoring only checked port reachability.

Incident Timeline

Three major problems were identified:

Service deadlock and database lock‑up

Tomcat memory‑overflow crashes

Load‑balancing inefficiencies leading to full system collapse

1. Service Deadlock

On September 21 the system became unresponsive. Investigation showed CPU spikes on the application server, high database CPU/IO, and a massive UPDATE operation on a table containing over 130 million rows. The operation performed table‑structure changes, index rebuilds, and data updates, locking the entire table and causing all services to hang. Restarting the database server restored service after about 30 minutes.

Mitigation steps:

Implemented table partitioning by time to avoid single‑table locks.

Restricted the “update instance” function to administrators only.

Enhanced on‑site staff training to prohibit risky operations during live deployments.

2. Tomcat Memory‑Overflow

Later that day two core services intermittently crashed due to OOM errors. Root causes differed:

Dead loop : Faulty front‑end logic caused a request cycle (A→B→A…) leading to unbounded memory growth.

Fuzzy query : A query that loaded >2 million rows into memory caused similar OOM; the fix limited results to 1 000 rows and moved heavy calculations to the database.

Remediation:

Fixed the looping code and added comprehensive test cases.

Implemented result‑size limits and refactored logic to reduce in‑memory data.

3. Load‑Balancing Issues

The hardware load balancer only monitored port availability, not service health, so when services entered a dead‑lock state the balancer still routed traffic to them. IP‑hash routing caused uneven load when a server went down, and the overall architecture (A→B→C) meant that failures in downstream services were ignored.

Solutions applied:

Added service‑level health checks (JSP endpoints returning HTTP 200 for normal operation).

Switched routing strategy from IP‑hash to round‑robin.

Adjusted program configuration to support true distributed load balancing.

System Crash on September 25

On September 25 the system became completely unresponsive for three hours. The root cause was a mistakenly enabled log4j.properties file set to DEBUG level, causing the Tomcat console to flood with logs from session‑synchronization (Shiro + Ehcache) and fingerprint‑login processing. The massive console output saturated CPU, leading to a full system hang.

Key findings:

Debug logging on production servers can trigger CPU spikes.

Session synchronization that logs large Base64‑encoded fingerprint data amplified the problem.

Virtual machines added latency and resource contention, worsening the CPU spike.

Resolution:

Disabled DEBUG logging in production.

Removed unnecessary session‑sync logging and limited fingerprint data handling.

Re‑engineered the environment with proper monitoring and a fallback emergency plan.

Conclusion & Lessons Learned

The incident highlighted several systemic weaknesses: inadequate operational procedures, insufficient logging and monitoring, lack of load‑balancing health checks, and poor change‑management discipline. Implementing stricter code reviews, realistic load‑balancing tests, comprehensive monitoring, and clear emergency runbooks are essential to prevent similar outages.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Operationsload balancingsystem reliabilityTomcatincident analysis
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.