Investigation of Sentinel QPS Mis‑reporting and Flow‑control Timing Bug
This article investigates why Sentinel’s QPS threshold of 20 is triggered at 15, analyzing metric logs, the underlying MetricNode and MetricTimerListener code, and revealing a timing discrepancy between second‑level and minute‑level counters that can cause premature flow‑control blocking.
Problem Statement
The author encountered an issue where Sentinel’s console was configured with a QPS threshold of 20, but the metric logs showed that flow control was triggered at a pass QPS of 15. The relevant metric log snippet is shown below.
The log format includes timestamp, resource name, pass QPS, block QPS, complete QPS, error QPS, response time, etc. For example:
1568253804000|2019-09-12 10:03:24|sop|5|0|5|0|0|0|0|0|3At second 24 the pass count is 5 with no blocks, while at second 25 the pass count jumps to 15 and block count becomes 2, causing the unexpected throttling.
Root Cause Exploration
The investigation starts with the data source of the metrics. The static block of com.alibaba.csp.sentinel.slots.block.flow.FlowRuleManager schedules a MetricTimerListener to run every second:
static {
currentProperty.addListener(LISTENER);
SCHEDULER.scheduleAtFixedRate(new MetricTimerListener(), 0, 1, TimeUnit.SECONDS);
}The listener aggregates MetricNode data from all resource nodes and writes them to the metric log file.
@Override
public void run() {
Map
> maps = new TreeMap<>();
for (Entry
e : ClusterBuilderSlot.getClusterNodeMap().entrySet()) {
ClusterNode node = e.getValue();
Map
metrics = node.metrics();
aggregate(maps, metrics, node);
}
aggregate(maps, Constants.ENTRY_NODE.metrics(), Constants.ENTRY_NODE);
if (!maps.isEmpty()) {
for (Entry
> entry : maps.entrySet()) {
try {
metricWriter.write(entry.getKey(), entry.getValue());
} catch (Exception e) {
RecordLog.warn("[MetricTimerListener] Write metric error", e);
}
}
}
}The MetricNode objects are obtained from a minute‑level rolling counter ( rollingCounterInMinute ), which is a time‑windowed data structure:
public Map
metrics() {
long currentTime = TimeUtil.currentTimeMillis();
currentTime = currentTime - currentTime % 1000;
Map
metrics = new ConcurrentHashMap<>();
List
nodesOfEverySecond = rollingCounterInMinute.details();
for (MetricNode node : nodesOfEverySecond) {
if (isNodeInTime(node, currentTime) && isValidMetricNode(node)) {
metrics.put(node.getTimestamp(), node);
}
}
return metrics;
}Both second‑level ( rollingCounterInSecond ) and minute‑level counters receive the same request count via addPassRequest :
@Override
public void addPassRequest(int count) {
rollingCounterInSecond.addPass(count);
rollingCounterInMinute.addPass(count);
}Because the two counters determine their time windows independently using the current system time, a high‑concurrency request that spans the boundary between two seconds may be recorded in different windows for the second‑level and minute‑level counters. This leads to the minute‑level statistics (which drive the dashboard and the log file) showing a lower pass QPS while the second‑level counter already considers the next second, causing premature flow‑control.
Community Confirmation
Similar issues have been reported on the Sentinel GitHub repository (issues #1041 and #1039). The core maintainer explained that the second‑level and minute‑level multidimensional statistics are currently based on independent timestamps, which can cause the last‑second blocked requests to be counted in the next second.
Conclusion
The root cause is the inconsistency between second‑level and minute‑level statistics. To fix the bug, both counters should share the same timestamp when aggregating data, ensuring that metrics are counted in the correct time window.
In short, the discrepancy between the two time windows leads to the observed premature throttling.
Manbang Technology Team
Manbang Technology Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.