Root Cause Analysis of Intermittent Timeout Issues in the Sirius Service Caused by RAID Card Consistency Checks
This article details the investigation of sporadic interface timeouts in the Sirius real‑time pricing service, revealing a weekly pattern linked to RAID controller consistency checks that cause IO spikes, logback queue blockage, and ultimately Dubbo client‑side timeouts, and proposes mitigation steps and general performance‑troubleshooting guidelines.
1. Introduction
Recently the real‑time pricing service (named Sirius ) experienced occasional interface timeouts. The article outlines the occurrence, investigation process, conclusions, and generic performance‑troubleshooting methods.
2. Sirius Service Overview
Sirius is a core hotel‑booking service providing four main APIs: List (L), Detail (D), Booking (B), and Order (O).
3. Occurrence Timeline
2024‑02‑10 (Lunar New Year): D‑page timeout rate increased; initial hypothesis pointed to GC.
2024‑02‑17: Similar timeout spike after batch restart; later observations disproved the GC hypothesis.
2024‑03‑01: Another spike on 02‑24, confirming a pattern.
4. Investigation Process
(1) Host‑Related Findings
Timeouts only occurred on specific pods sharing the same host, indicating a host‑level issue.
(2) Periodic Pattern
All spikes happen around 11:00 AM every Saturday, starting from 2023‑10‑28, with increasing duration and severity.
(3) Abnormal IO Usage
During the problematic windows, CPU wait, disk IO usage, and process blocked metrics all showed sharp spikes, strongly correlated with D‑page timeout trends.
(4) Source of Abnormal IO
Investigation of RAID cards revealed that hosts with the LSI MegaRAID model and SSD RAID‑1 arrays exhibit periodic consistency‑check (CC) operations every Saturday at 03:00 UTC, causing IO contention.
(5) Impact on Application
High IO usage leads to logback’s asynchronous queue filling; when the queue is full, logback blocks request threads, causing Dubbo client‑side timeouts and thus the observed D‑page interface timeouts.
5. Conclusions
(1) Trigger Conditions
Hardware: Specific RAID card model with scheduled CC/PR tasks.
Software: High disk IO from Sirius and latency‑sensitive services.
(2) Impact Scope
Currently only Sirius is affected, but any service with similar characteristics could suffer the same issue.
(3) Loss Assessment
Although the issue persisted for months, its impact remained below fault thresholds; however, the trend was worsening.
(4) Solutions
Operations: Disable RAID consistency checks for the affected model across the fleet.
Business: Reduce unnecessary logging to avoid queue blockage.
6. General Performance‑Problem Troubleshooting Methodology
Key steps include finding patterns, using control groups for comparison, reproducing the issue, making bold hypotheses and carefully validating them, and focusing on the four main resource domains: CPU, memory, disk, and network.
7. References
Logback asynchronous appender documentation.
RAID consistency‑check description (Wikipedia).
Broadcom MegaRAID consistency‑check feature details.
SSD fragmentation information.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.