Mastering Rate Limiting in Spring Cloud Gateway: Algorithms, Implementations, and Pitfalls

This article explores the evolution of Spring Cloud Gateway, explains common rate‑limiting scenarios and algorithms such as fixed‑window, sliding‑window, leaky‑bucket and token‑bucket, reviews open‑source limiters like Guava RateLimiter, Bucket4j and Resilience4j, and provides detailed guidance for implementing both single‑node and distributed rate‑limiting and concurrency‑limiting solutions within the gateway.

Java Interview Crash Guide
Java Interview Crash Guide
Java Interview Crash Guide
Mastering Rate Limiting in Spring Cloud Gateway: Algorithms, Implementations, and Pitfalls

Spring Cloud Gateway Overview

Before Spring Cloud Gateway, Netflix Zuul was the default gateway in Spring Cloud, but Zuul 1.x suffered from blocking APIs and lack of WebSocket support. Spring Cloud Gateway, introduced in the Finchley release (June 2018), is built on Spring Framework 5, Spring Boot 2.0 and Project Reactor, offering reactive, non‑blocking APIs and WebSocket support.

Built on Spring Framework 5, Project Reactor and Spring Boot 2.0

Route matching on any request attribute

Predicates and filters specific to routes

Hystrix circuit‑breaker integration

Spring Cloud DiscoveryClient integration

Easy to write predicates and filters

Request rate limiting

Path rewriting

Gateway integrates seamlessly with other Spring Cloud components, providing a simple Predicate and Filter mechanism for per‑route request handling.

Common Rate‑Limiting Scenarios

Rate limiting controls request rates to improve system resilience during traffic spikes. Typical scenarios include:

Limit an API to 100 requests per minute

Limit a user’s download speed to 100 KB/s

Allow a maximum of 5 concurrent requests per user

Block all requests from a specific IP

Two main objects are request‑frequency limiting and concurrent‑request limiting . The former restricts the number of calls per time unit, while the latter caps simultaneous executions.

Handling Strategies

When a request exceeds the limit, three strategies are common:

Reject the request (e.g., HTTP 429)

Queue the request for later processing

Provide a fallback response (service degradation)

Architectural Choices

Rate limiting can be applied at the gateway layer (centralized) or at the middleware layer (distributed). In single‑node deployments, an in‑memory limiter suffices; in clustered environments, a shared component such as Redis, Hazelcast, or other distributed caches is required.

Popular Limiting Algorithms

Fixed Window

The fixed‑window algorithm counts requests within a discrete time bucket (e.g., one minute). When the counter exceeds the threshold, further requests are blocked until the next bucket.

public class FixedWindowLimiter {
    private final AtomicLong counter = new AtomicLong();
    private final long windowSizeMs = 60000; // 1 minute
    private volatile long windowStart = System.currentTimeMillis();
    public boolean tryAcquire() {
        long now = System.currentTimeMillis();
        if (now - windowStart >= windowSizeMs) {
            windowStart = now;
            counter.set(0);
        }
        return counter.incrementAndGet() <= limit;
    }
}

The main drawback is the “boundary problem”: a burst at the end of one window and the start of the next can double the effective rate.

Sliding Window

Sliding‑window divides the main interval into smaller sub‑windows and slides them over time, summing counts from all sub‑windows for a smoother limit.

public class SlidingWindowLimiter {
    private final int subWindowCount = 5; // e.g., 5 sub‑windows of 1 s each
    private final long[] counters = new long[subWindowCount];
    private final long subWindowSizeMs = 1000;
    private volatile long lastTick = System.currentTimeMillis();
    public synchronized boolean tryAcquire() {
        long now = System.currentTimeMillis();
        long elapsed = now - lastTick;
        int shift = (int) (elapsed / subWindowSizeMs);
        if (shift > 0) {
            for (int i = 0; i < Math.min(shift, subWindowCount); i++) {
                System.arraycopy(counters, 1, counters, 0, subWindowCount - 1);
                counters[subWindowCount - 1] = 0;
            }
            lastTick = now;
        }
        long total = 0;
        for (long c : counters) total += c;
        if (total >= limit) return false;
        counters[subWindowCount - 1]++;
        return true;
    }
}

Leaky Bucket

Requests are queued in a bucket that drains at a constant rate. Excess requests are dropped when the bucket is full, smoothing bursty traffic.

public class LeakyBucket {
    private final long capacity;
    private final long leakRate; // tokens per second
    private long tokens;
    private long lastLeakTimestamp;
    public synchronized boolean tryAcquire() {
        long now = System.nanoTime();
        long elapsed = now - lastLeakTimestamp;
        long leaked = (elapsed * leakRate) / 1_000_000_000L;
        tokens = Math.max(0, tokens - leaked);
        lastLeakTimestamp = now;
        if (tokens < capacity) {
            tokens++;
            return true;
        }
        return false;
    }
}

Token Bucket

Tokens are added to a bucket at a fixed rate; each request consumes a token. The bucket size determines burst capacity.

public class TokenBucket {
    private final long capacity;
    private final double refillRatePerMs;
    private double tokens;
    private long lastRefillTimestamp;
    public TokenBucket(long capacity, long refillTokens, long refillPeriodMs) {
        this.capacity = capacity;
        this.refillRatePerMs = (double) refillTokens / refillPeriodMs;
        this.tokens = capacity;
        this.lastRefillTimestamp = System.currentTimeMillis();
    }
    public synchronized boolean tryConsume(int n) {
        refill();
        if (tokens < n) return false;
        tokens -= n;
        return true;
    }
    private void refill() {
        long now = System.currentTimeMillis();
        long delta = now - lastRefillTimestamp;
        if (delta > 0) {
            tokens = Math.min(capacity, tokens + delta * refillRatePerMs);
            lastRefillTimestamp = now;
        }
    }
}

Open‑Source Rate Limiters

Guava RateLimiter

Guava implements a token‑bucket limiter with two modes: SmoothBursty and SmoothWarmingUp . Example:

RateLimiter limiter = RateLimiter.create(5); // 5 permits per second
limiter.acquire(); // blocks until a permit is available

Bucket4j

Bucket4j provides a token‑bucket implementation with support for distributed caches (Hazelcast, Ignite, etc.). Core concepts are Bucket, Bandwidth and Refill. Example:

Bucket bucket = Bucket4j.builder()
    .addLimit(Bandwidth.classic(10, Refill.greedy(5, Duration.ofMinutes(1))))
    .build();
if (bucket.tryConsume(1)) {
    // allowed
}

Resilience4j

Resilience4j offers both SemaphoreBasedRateLimiter (token‑bucket style) and AtomicRateLimiter . It also provides Bulkhead for concurrency limiting.

BulkheadConfig bulkConfig = BulkheadConfig.custom()
    .maxConcurrentCalls(150)
    .maxWaitTime(Duration.ofMillis(100))
    .build();
Bulkhead bulkhead = Bulkhead.of("backend", bulkConfig);
RateLimiterConfig rlConfig = RateLimiterConfig.custom()
    .limitForPeriod(1)
    .limitRefreshPeriod(Duration.ofSeconds(1))
    .timeoutDuration(Duration.ofMillis(100))
    .build();
RateLimiter rl = RateLimiter.of("backend", rlConfig);

Implementing Rate Limiting in Spring Cloud Gateway

Single‑Node Request‑Frequency Limiting

Gateway defines a RateLimiter interface with a single method isAllowed(routeId, id). By providing a custom KeyResolver (e.g., based on client IP) and a local limiter implementation, you can enforce per‑client limits without Redis.

public interface KeyResolver {
    Mono<String> resolve(ServerWebExchange exchange);
}
public class HostAddrKeyResolver implements KeyResolver {
    @Override
    public Mono<String> resolve(ServerWebExchange exchange) {
        return Mono.just(exchange.getRequest().getRemoteAddress().getAddress().getHostAddress());
    }
}

The default RedisRateLimiter uses a Lua script ( request_rate_limiter.lua) to perform atomic token‑bucket calculations in Redis.

local tokens_key = KEYS[1]
local timestamp_key = KEYS[2]
local rate = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
-- token bucket logic omitted for brevity
return {allowed, new_tokens}

Configuration Examples

YAML configuration for a route‑level limiter:

spring:
  cloud:
    gateway:
      routes:
      - id: test
        uri: http://httpbin.org:80
        filters:
        - name: RequestRateLimiter
          args:
            key-resolver: '#{@hostAddrKeyResolver}'
            redis-rate-limiter.replenishRate: 1
            redis-rate-limiter.burstCapacity: 3

Java‑based configuration using RedisRateLimiter:

@Bean
public RouteLocator myRoutes(RouteLocatorBuilder builder) {
    return builder.routes()
        .route(p -> p.path("/get")
            .filters(f -> f.requestRateLimiter()
                .rateLimiter(RedisRateLimiter.class, rl -> rl.setBurstCapacity(3).setReplenishRate(1)))
            .uri("http://httpbin.org:80"))
        .build();
}

Distributed Request‑Frequency Limiting

Spring Cloud Gateway’s built‑in RedisRateLimiter already provides a distributed token‑bucket implementation using the Lua script shown above. It supports per‑second granularity; sub‑second rates are not currently supported.

Concurrent‑Request Limiting (Bulkhead)

Resilience4j’s Bulkhead can limit concurrent calls either via semaphores or a fixed thread pool. Example configuration:

BulkheadConfig bulkConfig = BulkheadConfig.custom()
    .maxConcurrentCalls(150)
    .maxWaitTime(Duration.ofMillis(100))
    .build();
Bulkhead bulkhead = Bulkhead.of("backend", bulkConfig);

Distributed Concurrency Limiting

Distributed semaphores can be built on Redis (Redisson RSemaphore) or Ignite ( IgniteSemaphore). The idea is to store a counter with a TTL so that crashes automatically release permits.

Two practical approaches:

Assign each request a unique key with a short TTL; count active keys with MGET or SCAN to determine current concurrency.

Maintain per‑instance counters (e.g., instances_xxx) and aggregate them to obtain total concurrency.

Double‑Window Sliding Algorithm for Distributed Concurrency

This algorithm uses two Redis keys representing the current and previous minute windows (e.g., 202009051130). A background thread periodically migrates expired requests from the previous window to the current one, ensuring only two keys need to be read with MGET. The algorithm guarantees atomic updates and handles node crashes via key TTLs.

Double‑window sliding algorithm diagram
Double‑window sliding algorithm diagram

Conclusion

Rate limiting is essential for protecting gateway stability. This article covered common scenarios, classic algorithms, open‑source libraries, and concrete implementations for both single‑node and distributed environments using Spring Cloud Gateway, Redis, Resilience4j, and Bucket4j. While the focus was on request‑frequency limiting, the same principles extend to concurrency limiting, and further topics such as Sentinel or advanced adaptive algorithms remain for future exploration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaSpring Cloud Gatewayresilience4j
Java Interview Crash Guide
Written by

Java Interview Crash Guide

Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.