Databases 9 min read

Why Did Redis Slowlog Show 1800 ms Delays? Uncovering a Kernel Time‑Source Bug

During a Redis container migration, slowlog entries suddenly reported 1800 ms latency despite low QPS, leading to a deep investigation that traced the anomaly to a kernel bug affecting the Time Stamp Counter on Skylake‑X CPUs and improper use of gettimeofday for timing.

ITPUB
ITPUB
ITPUB
Why Did Redis Slowlog Show 1800 ms Delays? Uncovering a Kernel Time‑Source Bug

Problem Description

After migrating Redis instances to containers, DBA alerts showed slowlog entries exceeding 500 ms, with some entries around 1800 ms. The first batch of hosts did not exhibit this behavior, and the higher QPS of the second batch could not explain the delay.

Analysis

What is Redis slowlog?

Redis records any command whose execution time exceeds the slowlog‑log‑slower‑than threshold. The recorded duration is measured with gettimeofday(), which only accounts for time spent inside Redis and ignores network latency.

Conflicting measurements

Slowlog consistently reported ~1800 ms, while the CAT tracing system (based on https://github.com/dianping/cat) showed a maximum of only 367 ms for the same operations, indicating a discrepancy.

Timer verification

A simple loop calling gettimeofday() once per second was used to verify the timer. Over ~20 minutes the test showed a drift of roughly 1813 ms per second, confirming that gettimeofday() was inaccurate on the affected hosts.

System clock investigation

Running

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

showed that all hosts used the Time Stamp Counter (TSC) as the clock source. On Skylake‑X CPUs, a kernel bug introduced in versions 4.9–4.13 mis‑calculates the crystal frequency, causing the clock to lose about 1 s every 10 minutes.

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

The second batch of hosts were Xeon Bronze 3104 (Skylake‑X) running kernel 4.10, while the first batch used older hardware and kernels, explaining the differing behavior.

Root Cause

The kernel commit that added the INTEL_FAM6_SKYLAKE_X macro mistakenly set the crystal frequency for Skylake‑X CPUs, leading to a systematic clock slowdown. Combined with NTP adjustments and Redis’s reliance on gettimeofday(), this produced the observed 1800 ms slowlog entries.

Recommendations

Use proper timing APIs

For wall‑clock timestamps, prefer clock_gettime(CLOCK_REALTIME) over the deprecated gettimeofday().

For measuring elapsed time, use clock_gettime(CLOCK_MONOTONIC) (or System.nanoTime() in Java) to avoid NTP‑induced jumps.

Upgrading the host kernel to version 4.14 or later eliminates the TSC bug, restoring accurate timing for Redis and other latency‑sensitive services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

RediscontainerizationLinuxtime sourcekernel bugSlowlog
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.