Why Did Redis Slowlog Show 1800 ms Delays? Uncovering a Kernel Time‑Source Bug
During a Redis container migration, slowlog entries suddenly reported 1800 ms latency despite low QPS, leading to a deep investigation that traced the anomaly to a kernel bug affecting the Time Stamp Counter on Skylake‑X CPUs and improper use of gettimeofday for timing.
Problem Description
After migrating Redis instances to containers, DBA alerts showed slowlog entries exceeding 500 ms, with some entries around 1800 ms. The first batch of hosts did not exhibit this behavior, and the higher QPS of the second batch could not explain the delay.
Analysis
What is Redis slowlog?
Redis records any command whose execution time exceeds the slowlog‑log‑slower‑than threshold. The recorded duration is measured with gettimeofday(), which only accounts for time spent inside Redis and ignores network latency.
Conflicting measurements
Slowlog consistently reported ~1800 ms, while the CAT tracing system (based on https://github.com/dianping/cat) showed a maximum of only 367 ms for the same operations, indicating a discrepancy.
Timer verification
A simple loop calling gettimeofday() once per second was used to verify the timer. Over ~20 minutes the test showed a drift of roughly 1813 ms per second, confirming that gettimeofday() was inaccurate on the affected hosts.
System clock investigation
Running
cat /sys/devices/system/clocksource/clocksource0/current_clocksourceshowed that all hosts used the Time Stamp Counter (TSC) as the clock source. On Skylake‑X CPUs, a kernel bug introduced in versions 4.9–4.13 mis‑calculates the crystal frequency, causing the clock to lose about 1 s every 10 minutes.
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
The second batch of hosts were Xeon Bronze 3104 (Skylake‑X) running kernel 4.10, while the first batch used older hardware and kernels, explaining the differing behavior.
Root Cause
The kernel commit that added the INTEL_FAM6_SKYLAKE_X macro mistakenly set the crystal frequency for Skylake‑X CPUs, leading to a systematic clock slowdown. Combined with NTP adjustments and Redis’s reliance on gettimeofday(), this produced the observed 1800 ms slowlog entries.
Recommendations
Use proper timing APIs
For wall‑clock timestamps, prefer clock_gettime(CLOCK_REALTIME) over the deprecated gettimeofday().
For measuring elapsed time, use clock_gettime(CLOCK_MONOTONIC) (or System.nanoTime() in Java) to avoid NTP‑induced jumps.
Upgrading the host kernel to version 4.14 or later eliminates the TSC bug, restoring accurate timing for Redis and other latency‑sensitive services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
