Databases 21 min read

Mastering Oracle RAC: Best Practices, Common Pitfalls, and Real-World Cases

This technical session covers Oracle RAC high‑availability best practices, installation steps, daily operational commands, detailed case studies of auto‑start checks, version‑mix issues, addNode failures, network heartbeat problems, and client connection errors, plus a concise Q&A on uninstall, SCAN vs VIP, and split‑brain detection.

ITPUB
ITPUB
ITPUB
Mastering Oracle RAC: Best Practices, Common Pitfalls, and Real-World Cases

Speaker Introduction

Ying Yifeng, supervisor of Hangzhou Meichuang Technology Service Department and Oracle 10g OCM, has over five years of professional Oracle database management experience, specializes in RAC high‑availability solutions, disaster recovery, and supports more than a hundred customer databases across various operating systems.

Presentation Overview

The live session is divided into three parts: (1) RAC installation best practices, (2) daily RAC operation basics, and (3) classic fault case studies.

Installation Best Practices

A minimal physical‑isolation topology is shown (see image). When load is light, a VLAN can be used to share public and heartbeat switches. A multi‑layer redundant architecture—heartbeat switch, public switch, optical cross‑connect, storage, and host—ensures short‑distance cross‑datacenter HA.

Key preparation steps include OS patching, verifying storage multipathing, checking network redundancy (dual NIC bonding), and tuning system parameters. ASM configuration and monitoring tools (adjusting AWR interval, deploying OSWATCH) are essential for both one‑time installations and long‑term maintenance.

Daily Operations and Commands

The RAC cluster consists of several core processes. The startup sequence is divided into three phases:

Init phase: /etc/inittab triggers init.ohasd, which starts ohasd.bin and agents (orarootagent, oraagent, cssdagnet, cssdmonitor).

Resource phase: processes such as mdnsd (multicast node discovery), gpnpd (bootstrap info distribution), gipcd (private interconnect management), and ocssd (heartbeat) are started.

Final phase: crsd launches cluster resources.

Commonly used commands: crs_stat – legacy 10g status command (still used for quick checks). crsctl status resource -t – detailed 11g resource status. ocrcheck – checks OCR disk information. asmcmd – interactive ASM management.

Important ASM views:

V$asm_diskgroup – disk group name, status, size, usage.

V$asm_disk – individual disk details, mount status, failgroup.

V$asm_operation – progress of rebalance or add/delete operations.

Case Studies

Auto‑start check : In 10g/11g RAC, clusters often start with the OS. Use crsctl status resource -t to verify auto‑start and optionally disable it for maintenance.

11g RAC managing a 10g database : Copying 10g datafiles to an 11g ASM disk group caused ORA‑29702 errors, indicating incompatibility when a newer RAC manages an older RDBMS.

AddNode failures : Three typical problems – self‑check failure ( PRVF-5636), Java OOM during file copy, and non‑readable files preventing transfer. Mitigations: set export IGNORE_PREADDNODE_CHECKS=Y, increase JRE_MEMORY_OPTIONS in oraparam.ini, and ignore non‑readable logs.

Network heartbeat missing : After OS reinstall, mismatched netmask (255.255.255.0 vs 255.0.0.0) caused OCSSd to report “disk HB but no network HB”, breaking the cluster heartbeat.

Client connection failure after RAC migration : ORA‑12543 “TNS:destination host unreachable” occurred only for cross‑subnet clients. Sqlnet trace showed the failure at the VIP/SCAN dispatch step (step 8). The issue was resolved by correcting SCAN/VIP configuration.

Q&A Highlights

RAC uninstall: use $ORACLE_HOME/deinstall/deinstall for a clean removal; manual file deletion is possible for 10g.

SCAN vs VIP: SCAN reduces client reconfiguration when adding/removing nodes, while some critical systems still prefer static VIP.

Split‑brain detection: examine OCSSd logs for arbitration information.

Arbitration disk replacement: add/remove disks in the ASM disk group; Oracle automatically rebalances.

Heartbeat network: double‑NIC bonding or AIX etherchannel is common; HAIP is mature in 11.2.0.4.

CPU imbalance: divergent execution plans can cause one node to consume high CPU; analyze with AWR reports.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

databasehigh availabilitytroubleshootingInstallationOracleRAC
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.