Backend Development 10 min read

Investigation of Sentinel QPS Mis‑reporting and Flow‑control Timing Bug

This article investigates why Sentinel’s QPS threshold of 20 is triggered at 15, analyzing metric logs, the underlying MetricNode and MetricTimerListener code, and revealing a timing discrepancy between second‑level and minute‑level counters that can cause premature flow‑control blocking.

Manbang Technology Team

Jun 10, 2021

Investigation of Sentinel QPS Mis‑reporting and Flow‑control Timing Bug

Problem Statement

The author encountered an issue where Sentinel’s console was configured with a QPS threshold of 20, but the metric logs showed that flow control was triggered at a pass QPS of 15. The relevant metric log snippet is shown below.

The log format includes timestamp, resource name, pass QPS, block QPS, complete QPS, error QPS, response time, etc. For example:

1568253804000|2019-09-12 10:03:24|sop|5|0|5|0|0|0|0|0|3

At second 24 the pass count is 5 with no blocks, while at second 25 the pass count jumps to 15 and block count becomes 2, causing the unexpected throttling.

Root Cause Exploration

The investigation starts with the data source of the metrics. The static block of com.alibaba.csp.sentinel.slots.block.flow.FlowRuleManager schedules a MetricTimerListener to run every second:

static {<br/>    currentProperty.addListener(LISTENER);<br/>    SCHEDULER.scheduleAtFixedRate(new MetricTimerListener(), 0, 1, TimeUnit.SECONDS);<br/>}

The listener aggregates MetricNode data from all resource nodes and writes them to the metric log file.

@Override<br/>public void run() {<br/>    Map<Long, List<MetricNode>> maps = new TreeMap<>();<br/>    for (Entry<ResourceWrapper, ClusterNode> e : ClusterBuilderSlot.getClusterNodeMap().entrySet()) {<br/>        ClusterNode node = e.getValue();<br/>        Map<Long, MetricNode> metrics = node.metrics();<br/>        aggregate(maps, metrics, node);<br/>    }<br/>    aggregate(maps, Constants.ENTRY_NODE.metrics(), Constants.ENTRY_NODE);<br/>    if (!maps.isEmpty()) {<br/>        for (Entry<Long, List<MetricNode>> entry : maps.entrySet()) {<br/>            try {<br/>                metricWriter.write(entry.getKey(), entry.getValue());<br/>            } catch (Exception e) {<br/>                RecordLog.warn("[MetricTimerListener] Write metric error", e);<br/>            }<br/>        }<br/>    }<br/>}

The MetricNode objects are obtained from a minute‑level rolling counter ( rollingCounterInMinute), which is a time‑windowed data structure:

public Map<Long, MetricNode> metrics() {<br/>    long currentTime = TimeUtil.currentTimeMillis();<br/>    currentTime = currentTime - currentTime % 1000;<br/>    Map<Long, MetricNode> metrics = new ConcurrentHashMap<>();<br/>    List<MetricNode> nodesOfEverySecond = rollingCounterInMinute.details();<br/>    for (MetricNode node : nodesOfEverySecond) {<br/>        if (isNodeInTime(node, currentTime) && isValidMetricNode(node)) {<br/>            metrics.put(node.getTimestamp(), node);<br/>        }<br/>    }<br/>    return metrics;<br/>}

Both second‑level ( rollingCounterInSecond) and minute‑level counters receive the same request count via addPassRequest:

@Override<br/>public void addPassRequest(int count) {<br/>    rollingCounterInSecond.addPass(count);<br/>    rollingCounterInMinute.addPass(count);<br/>}

Because the two counters determine their time windows independently using the current system time, a high‑concurrency request that spans the boundary between two seconds may be recorded in different windows for the second‑level and minute‑level counters. This leads to the minute‑level statistics (which drive the dashboard and the log file) showing a lower pass QPS while the second‑level counter already considers the next second, causing premature flow‑control.

Community Confirmation

Similar issues have been reported on the Sentinel GitHub repository (issues #1041 and #1039). The core maintainer explained that the second‑level and minute‑level multidimensional statistics are currently based on independent timestamps, which can cause the last‑second blocked requests to be counted in the next second.

Conclusion

The root cause is the inconsistency between second‑level and minute‑level statistics. To fix the bug, both counters should share the same timestamp when aggregating data, ensuring that metrics are counted in the correct time window.

In short, the discrepancy between the two time windows leads to the observed premature throttling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Metrics Sentinel Flow Control QPS Bug Analysis

Written by

Manbang Technology Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.