Misconfigured Oracle Parallel Parameters Caused a System Outage – DBA Case Study
In this detailed DBA case study, the speaker explains how improper settings of Oracle's parallel execution parameters on an AIX‑based 10gR2 database led to process saturation, undo‑related wait events, and temporary connection issues, and describes the step‑by‑step diagnosis, parameter adjustments, and lessons learned for future tuning.
On November 5, Senior DBA Yao Yu presented an online session for the “DBA+深港群” community about a real production incident where incorrect Oracle parallel parameter settings caused a system fault.
System Description
The environment consisted of AIX 5.3, Oracle Database 10.2.0.3 (single‑instance) supporting an Oracle Portal application used for generating reports. The database size was about 14 TB with an SGA of 80 GB.
Fault Manifestation
During lunch, the monitoring system (SMC Service Management Center) raised an alarm indicating database connection problems, although the business users reported no impact. A health check showed that all database instances were still running; no instance crash occurred. Screenshots showed multiple idle instances on a single host.
Fault Resolution
Login attempts intermittently returned ORA-01012 not logged on or “connected to an idle instance”. After generating a PFILE for reference, the DBA noticed that the processes parameter was near its limit and that the parallel_max_servers value was set to 685, far higher than expected.
Further inspection of v$px_process revealed over 260 ora_pxx processes, many in an available state, indicating that parallel processes were allocated but not actively executing work. The undo wait event “wait for a undo record” appeared prominently in an AWR report.
Since parallel_max_servers is a dynamic parameter, the DBA reduced its value (halving it) and monitored the effect. After a few minutes, the number of parallel processes dropped, OS‑level ora_pxx processes decreased, and the overall processes count fell from 800 to around 600. No further alerts were raised, confirming that the adjustment took effect.
Fault Analysis
The AWR report highlighted a large amount of “wait for a undo record” events, which are associated with parallel transaction recovery and fast‑start parallel rollback. In such rollback, the SMON background process coordinates multiple server processes to roll back long‑running parallel DML statements.
When many parallel rollback slaves contend for the same resources, performance can degrade, causing symptoms that resemble a database hang. In this case, a long‑running parallel SQL was killed by the business team, triggering fast‑start parallel rollback and exhausting parallel server resources.
Documentation confirms that the FAST_START_PARALLEL_ROLLBACK parameter controls this behavior. The DBA lowered it to a low setting to prevent aggressive parallel rollback in the future.
Summary
The incident, which occurred in late September, was fully resolved after about a month of monitoring, and the database has remained stable since.
Key takeaways: automatically calculated parallel parameters may not suit specific workloads; DBA‑level awareness of PARALLEL_MAX_SERVERS, FAST_START_PARALLEL_ROLLBACK, and related undo wait events is essential. Proactive tuning based on business patterns can prevent similar resource‑contention issues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
