Investigation of Intermittent Redis Timeout Issues Caused by a Kernel Scheduling Bug on Skylake Servers
The article details how Ctrip engineers diagnosed sporadic Redis timeouts in containerized deployments, traced the problem to kernel scheduling delays caused by an APIC‑ID bug that inflated the possible‑CPU count, and resolved it by applying a kernel patch, offering verification steps for affected systems.
Ctrip's large‑scale container deployment began showing frequent, low‑QPS Redis timeouts, prompting a deep dive into the root cause.
Initial packet captures across the app container, Redis container, and host revealed delayed response paths (B→A→C→D and back) and pointed to a problematic link between host A and its switch E, though network equipment was later ruled out.
Further investigation uncovered abnormal ping latencies and inconsistent behavior across servers with different kernel versions (3.10, 4.10, 4.14), indicating the issue was not tied to a specific kernel release.
By examining a time‑drift logging program running on the hosts, engineers observed growing clock offsets; TSC measurements using rdtscp showed large jumps, and perf sched record -a sleep 60 alongside perf sched latency -s max recorded scheduler delays exceeding one second.
Tracing the symptom led to a kernel commit between 4.14.36 and 4.14.37 that fixed handling of an invalid APIC ID (0xffffffff), which previously caused the kernel to report an inflated number of possible CPUs. The excessive possible‑CPU count caused loops such as for_each_possible_cpu() to run tens of times longer, severely degrading scheduler performance.
Applying the patch restored the correct possible‑CPU count, eliminated the scheduler stalls, and returned Redis latency to normal; performance differences were confirmed by comparing perf statistics from a patched host (uptime 89 days) with an unpatched one (uptime 2 days).
In conclusion, kernels 4.10–4.14.37 on Skylake‑class CPUs are vulnerable to this bug; checking /sys/devices/system/cpu/cpu*/online for mismatched possible‑CPU numbers provides a quick way to identify affected hosts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
