Operations 22 min read

Root Cause Analysis of Intermittent Timeout Issues in the Sirius Service Caused by RAID Card Consistency Checks

This article details the investigation of sporadic interface timeouts in the Sirius real‑time pricing service, revealing a weekly pattern linked to RAID controller consistency checks that cause IO spikes, logback queue blockage, and ultimately Dubbo client‑side timeouts, and proposes mitigation steps and general performance‑troubleshooting guidelines.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Root Cause Analysis of Intermittent Timeout Issues in the Sirius Service Caused by RAID Card Consistency Checks

1. Introduction

Recently the real‑time pricing service (named Sirius ) experienced occasional interface timeouts. The article outlines the occurrence, investigation process, conclusions, and generic performance‑troubleshooting methods.

2. Sirius Service Overview

Sirius is a core hotel‑booking service providing four main APIs: List (L), Detail (D), Booking (B), and Order (O).

3. Occurrence Timeline

2024‑02‑10 (Lunar New Year): D‑page timeout rate increased; initial hypothesis pointed to GC.

2024‑02‑17: Similar timeout spike after batch restart; later observations disproved the GC hypothesis.

2024‑03‑01: Another spike on 02‑24, confirming a pattern.

4. Investigation Process

(1) Host‑Related Findings

Timeouts only occurred on specific pods sharing the same host, indicating a host‑level issue.

(2) Periodic Pattern

All spikes happen around 11:00 AM every Saturday, starting from 2023‑10‑28, with increasing duration and severity.

(3) Abnormal IO Usage

During the problematic windows, CPU wait, disk IO usage, and process blocked metrics all showed sharp spikes, strongly correlated with D‑page timeout trends.

(4) Source of Abnormal IO

Investigation of RAID cards revealed that hosts with the LSI MegaRAID model and SSD RAID‑1 arrays exhibit periodic consistency‑check (CC) operations every Saturday at 03:00 UTC, causing IO contention.

(5) Impact on Application

High IO usage leads to logback’s asynchronous queue filling; when the queue is full, logback blocks request threads, causing Dubbo client‑side timeouts and thus the observed D‑page interface timeouts.

5. Conclusions

(1) Trigger Conditions

Hardware: Specific RAID card model with scheduled CC/PR tasks.

Software: High disk IO from Sirius and latency‑sensitive services.

(2) Impact Scope

Currently only Sirius is affected, but any service with similar characteristics could suffer the same issue.

(3) Loss Assessment

Although the issue persisted for months, its impact remained below fault thresholds; however, the trend was worsening.

(4) Solutions

Operations: Disable RAID consistency checks for the affected model across the fleet.

Business: Reduce unnecessary logging to avoid queue blockage.

6. General Performance‑Problem Troubleshooting Methodology

Key steps include finding patterns, using control groups for comparison, reproducing the issue, making bold hypotheses and carefully validating them, and focusing on the four main resource domains: CPU, memory, disk, and network.

7. References

Logback asynchronous appender documentation.

RAID consistency‑check description (Wikipedia).

Broadcom MegaRAID consistency‑check feature details.

SSD fragmentation information.

backendmonitoringperformanceoperationsLogbackroot cause analysisRAID
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.