How to Achieve Zero‑Fault Database Operations: Real‑World Cases and Management Practices
This article shares practical experiences from a DBAplus Guangzhou tech salon, detailing three real Oracle database incident cases, the root‑cause analyses, and a three‑step framework for rapid resolution, prevention, and team management to maintain zero‑fault operations across thousands of database instances.
Based on Liu Shuan's talk at the DBAplus Guangzhou Operations Technology Salon (March 23, 2018), the author, a manager of Ping An Technology's database operations team, describes how the team managed over 10,000 database instances (Oracle, PostgreSQL, MySQL, Redis, MongoDB) and maintained two consecutive years of zero‑fault operation.
Problem Solving Philosophy
The speaker likens DBAs to physicians: rather than merely fixing symptoms (the "Bian Que" approach), they should anticipate and eliminate root causes before they manifest, akin to the "elder brother" who prevents disease.
Case 1 – Oracle 11.2.0.4 SPM Plan Stabilization
Issue: After major releases, certain SQL statements produced abnormal execution plans, often involving a SKIP SCAN on a time‑based column.
Analysis: The problem stemmed from outdated index statistics and an index with a leading column containing many nulls, causing the optimizer to choose inefficient plans.
Solution: Re‑solidify execution plans via SPM, collect fresh index statistics, and finally disable or drop the problematic index. The team also discovered that SPM is fragile when statements change slightly during releases.
Case 2 – Oracle 12.1.0.2 Night‑time CPU Spike
Issue: At night the CPU reached 100 % and a massive number of connections were created.
Root Cause: A simple SELECT USER FROM SYS.DUAL executed by a login trigger, triggered repeatedly during automatic statistics collection.
Resolution: Locked the statistics collection on the DUAL table, preventing the trigger from repeatedly firing and stabilizing CPU usage.
Case 3 – Oracle 10.2 Upgrade Causing Cursor:pin Waits
Issue: After upgrading from 10.2.0.5.X to 10.2.0.5.18, frequent cursor:pin waits appeared, especially during high‑frequency queries that called a function millions of times.
Analysis: Function calls were amplified by full‑table scans on large tables; time‑based indexes suffered from stale statistics and poor selectivity.
Mitigation: Created a PCT FREE 99 index covering all columns to spread rows across many index blocks, reducing hot‑block contention. Later the index was removed when it interfered with a month‑end reporting job.
Operational Management Practices
The speaker outlines three pillars:
Change Management: Strict daytime change freeze, mandatory approvals, and thorough pre‑implementation testing.
Standardization: Define architecture and operational standards (e.g., co‑locating Redis with applications to avoid firewall overload) and continuously refine them based on real‑world incidents.
Team Development: Rotate responsibilities, document recurring issues in a knowledge base, conduct regular skill‑sharing sessions, and ensure clear ownership to avoid knowledge silos.
Overall, the talk emphasizes proactive design (preventive indexing, partitioning by time), systematic incident analysis, and a culture of continuous improvement to sustain high‑availability database services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
