Investigation of Sentinel QPS Mis‑reporting and Flow‑control Timing Bug

This article investigates why Sentinel’s QPS threshold of 20 is triggered at 15, analyzing metric logs, the underlying MetricNode and MetricTimerListener code, and revealing a timing discrepancy between second‑level and minute‑level counters that can cause premature flow‑control blocking.

Manbang Technology Team
Manbang Technology Team
Manbang Technology Team
Investigation of Sentinel QPS Mis‑reporting and Flow‑control Timing Bug

Problem Statement

The author encountered an issue where Sentinel’s console was configured with a QPS threshold of 20, but the metric logs showed that flow control was triggered at a pass QPS of 15. The relevant metric log snippet is shown below.

The log format includes timestamp, resource name, pass QPS, block QPS, complete QPS, error QPS, response time, etc. For example:

1568253804000|2019-09-12 10:03:24|sop|5|0|5|0|0|0|0|0|3

At second 24 the pass count is 5 with no blocks, while at second 25 the pass count jumps to 15 and block count becomes 2, causing the unexpected throttling.

Root Cause Exploration

The investigation starts with the data source of the metrics. The static block of com.alibaba.csp.sentinel.slots.block.flow.FlowRuleManager schedules a MetricTimerListener to run every second:

static {<br/>    currentProperty.addListener(LISTENER);<br/>    SCHEDULER.scheduleAtFixedRate(new MetricTimerListener(), 0, 1, TimeUnit.SECONDS);<br/>}

The listener aggregates MetricNode data from all resource nodes and writes them to the metric log file.

@Override<br/>public void run() {<br/>    Map<Long, List<MetricNode>> maps = new TreeMap<>();<br/>    for (Entry<ResourceWrapper, ClusterNode> e : ClusterBuilderSlot.getClusterNodeMap().entrySet()) {<br/>        ClusterNode node = e.getValue();<br/>        Map<Long, MetricNode> metrics = node.metrics();<br/>        aggregate(maps, metrics, node);<br/>    }<br/>    aggregate(maps, Constants.ENTRY_NODE.metrics(), Constants.ENTRY_NODE);<br/>    if (!maps.isEmpty()) {<br/>        for (Entry<Long, List<MetricNode>> entry : maps.entrySet()) {<br/>            try {<br/>                metricWriter.write(entry.getKey(), entry.getValue());<br/>            } catch (Exception e) {<br/>                RecordLog.warn("[MetricTimerListener] Write metric error", e);<br/>            }<br/>        }<br/>    }<br/>}

The MetricNode objects are obtained from a minute‑level rolling counter ( rollingCounterInMinute), which is a time‑windowed data structure:

public Map<Long, MetricNode> metrics() {<br/>    long currentTime = TimeUtil.currentTimeMillis();<br/>    currentTime = currentTime - currentTime % 1000;<br/>    Map<Long, MetricNode> metrics = new ConcurrentHashMap<>();<br/>    List<MetricNode> nodesOfEverySecond = rollingCounterInMinute.details();<br/>    for (MetricNode node : nodesOfEverySecond) {<br/>        if (isNodeInTime(node, currentTime) && isValidMetricNode(node)) {<br/>            metrics.put(node.getTimestamp(), node);<br/>        }<br/>    }<br/>    return metrics;<br/>}

Both second‑level ( rollingCounterInSecond) and minute‑level counters receive the same request count via addPassRequest:

@Override<br/>public void addPassRequest(int count) {<br/>    rollingCounterInSecond.addPass(count);<br/>    rollingCounterInMinute.addPass(count);<br/>}

Because the two counters determine their time windows independently using the current system time, a high‑concurrency request that spans the boundary between two seconds may be recorded in different windows for the second‑level and minute‑level counters. This leads to the minute‑level statistics (which drive the dashboard and the log file) showing a lower pass QPS while the second‑level counter already considers the next second, causing premature flow‑control.

Community Confirmation

Similar issues have been reported on the Sentinel GitHub repository (issues #1041 and #1039). The core maintainer explained that the second‑level and minute‑level multidimensional statistics are currently based on independent timestamps, which can cause the last‑second blocked requests to be counted in the next second.

Conclusion

The root cause is the inconsistency between second‑level and minute‑level statistics. To fix the bug, both counters should share the same timestamp when aggregating data, ensuring that metrics are counted in the correct time window.

In short, the discrepancy between the two time windows leads to the observed premature throttling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaMetricssentinelFlow ControlQPSBug Analysis
Manbang Technology Team
Written by

Manbang Technology Team

Manbang Technology Team

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.