Mastering Oracle RAC: Best Practices, Common Pitfalls, and Real-World Cases
This technical session covers Oracle RAC high‑availability best practices, installation steps, daily operational commands, detailed case studies of auto‑start checks, version‑mix issues, addNode failures, network heartbeat problems, and client connection errors, plus a concise Q&A on uninstall, SCAN vs VIP, and split‑brain detection.
Speaker Introduction
Ying Yifeng, supervisor of Hangzhou Meichuang Technology Service Department and Oracle 10g OCM, has over five years of professional Oracle database management experience, specializes in RAC high‑availability solutions, disaster recovery, and supports more than a hundred customer databases across various operating systems.
Presentation Overview
The live session is divided into three parts: (1) RAC installation best practices, (2) daily RAC operation basics, and (3) classic fault case studies.
Installation Best Practices
A minimal physical‑isolation topology is shown (see image). When load is light, a VLAN can be used to share public and heartbeat switches. A multi‑layer redundant architecture—heartbeat switch, public switch, optical cross‑connect, storage, and host—ensures short‑distance cross‑datacenter HA.
Key preparation steps include OS patching, verifying storage multipathing, checking network redundancy (dual NIC bonding), and tuning system parameters. ASM configuration and monitoring tools (adjusting AWR interval, deploying OSWATCH) are essential for both one‑time installations and long‑term maintenance.
Daily Operations and Commands
The RAC cluster consists of several core processes. The startup sequence is divided into three phases:
Init phase: /etc/inittab triggers init.ohasd, which starts ohasd.bin and agents (orarootagent, oraagent, cssdagnet, cssdmonitor).
Resource phase: processes such as mdnsd (multicast node discovery), gpnpd (bootstrap info distribution), gipcd (private interconnect management), and ocssd (heartbeat) are started.
Final phase: crsd launches cluster resources.
Commonly used commands: crs_stat – legacy 10g status command (still used for quick checks). crsctl status resource -t – detailed 11g resource status. ocrcheck – checks OCR disk information. asmcmd – interactive ASM management.
Important ASM views:
V$asm_diskgroup – disk group name, status, size, usage.
V$asm_disk – individual disk details, mount status, failgroup.
V$asm_operation – progress of rebalance or add/delete operations.
Case Studies
Auto‑start check : In 10g/11g RAC, clusters often start with the OS. Use crsctl status resource -t to verify auto‑start and optionally disable it for maintenance.
11g RAC managing a 10g database : Copying 10g datafiles to an 11g ASM disk group caused ORA‑29702 errors, indicating incompatibility when a newer RAC manages an older RDBMS.
AddNode failures : Three typical problems – self‑check failure ( PRVF-5636), Java OOM during file copy, and non‑readable files preventing transfer. Mitigations: set export IGNORE_PREADDNODE_CHECKS=Y, increase JRE_MEMORY_OPTIONS in oraparam.ini, and ignore non‑readable logs.
Network heartbeat missing : After OS reinstall, mismatched netmask (255.255.255.0 vs 255.0.0.0) caused OCSSd to report “disk HB but no network HB”, breaking the cluster heartbeat.
Client connection failure after RAC migration : ORA‑12543 “TNS:destination host unreachable” occurred only for cross‑subnet clients. Sqlnet trace showed the failure at the VIP/SCAN dispatch step (step 8). The issue was resolved by correcting SCAN/VIP configuration.
Q&A Highlights
RAC uninstall: use $ORACLE_HOME/deinstall/deinstall for a clean removal; manual file deletion is possible for 10g.
SCAN vs VIP: SCAN reduces client reconfiguration when adding/removing nodes, while some critical systems still prefer static VIP.
Split‑brain detection: examine OCSSd logs for arbitration information.
Arbitration disk replacement: add/remove disks in the ASM disk group; Oracle automatically rebalances.
Heartbeat network: double‑NIC bonding or AIX etherchannel is common; HAIP is mature in 11.2.0.4.
CPU imbalance: divergent execution plans can cause one node to consume high CPU; analyze with AWR reports.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
