How to Tackle Oracle Bad Blocks: Practical Strategies for DBAs
The article explains why Oracle bad block incidents demand scenario‑driven handling, outlines interview questions, describes the pitfalls of inspecting each alert log, and advocates promptly switching to disaster‑recovery and using backups to restore service, while sharing real‑world cases and practical DBA advice.
Recent discussions at the Gdevops Global Agile Operations Summit emphasized that technology must be driven by business scenarios; otherwise, platform construction is meaningless. For DBAs, this means aligning maintenance techniques with real‑world operational needs.
Oracle bad block problems are common for DBAs with two or three years of experience. Interviewers often ask: Do you understand Oracle bad blocks? Why do they occur? Describe a case you handled. If multiple databases suddenly show many bad blocks, what would you do?
Many candidates answer the first questions well but falter on the last one because they focus on checking each database’s alert log, which is inefficient when dozens or hundreds of instances are involved.
The correct approach is scenario‑driven: when the number of bad blocks is large, immediately stop the affected services, switch to the disaster‑recovery (DR) environment, and restore data later. As the saying goes, “You nurture a database for years; you use it in an emergency.”
When should you trigger DR? Generally, if a fault is expected to keep the business down for more than two hours, you should declare a DR switch. In financial sectors, the threshold is much tighter—one minute for securities, half an hour for broader reporting.
Effective DR requires prior planning: build a reliable DR environment, create a switch‑over plan, and test it regularly. Rapid detection and reporting are crucial; an automated operations platform can provide a button‑click view of which objects are affected by bad blocks, greatly speeding up response.
From an interview perspective, the key is to ask why multiple databases experience bad blocks simultaneously. The root cause is often external, such as a storage management software bug. In many cases, the issue lies in the storage layer rather than the database itself.
Example 1: A bug in Storage Foundation’s volume replication caused widespread bad blocks across Oracle, IBM, and other vendors. After a month of investigation, the culprit was identified.
Example 2: After a storage software state recovery, bad blocks persisted, requiring manual fsck repairs on each system.
Another recent incident involved an unexpected termination of the LGWR process, leading to numerous datafile corruptions that prevented the database from opening. Screenshots of the error messages are shown below:
The database could not start, and internal errors appeared (see images). The fastest resolution was to restore from the latest backup combined with archived logs, rather than attempting to repair each bad block.
The lesson is to prioritize the technique that restores business continuity fastest, not the most technically sophisticated method. In this case, a solid backup strategy saved the day, even without a DR environment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
