Backend Development 29 min read

Designing a Million‑QPS Rate Limiter for Backend System Interviews

This article walks through a complete, interview‑ready design of a high‑performance rate‑limiting system that can handle up to one million queries per second, covering requirements, core entities, algorithm choices, distributed state storage with Redis, scalability, high availability, latency optimization, hot‑key mitigation, and dynamic rule configuration.

Su San Talks Tech

Sep 18, 2025

Designing a Million‑QPS Rate Limiter for Backend System Interviews

1. What Is a Rate Limiter

A rate limiter is a mechanism that controls request traffic in high‑concurrency systems, preventing overload by limiting the number of requests a user or service can make within a given time window (e.g., 100 requests per minute).

2. Requirement Analysis

Interviewer: “Let’s discuss system design. How would you implement a high‑performance rate limiter?”

2.1 Functional Requirements

2.1.1 Core Requirements

The system must identify clients by user ID, IP address, or API key to apply appropriate limits.

Configurable rules should limit HTTP requests (e.g., 100 API calls per minute per user).

When the limit is exceeded, the system must reject the request with HTTP 429 and include useful headers such as remaining quota and reset time.

2.2 Non‑functional Requirements

Assume the target scale is 100 million daily active users generating 1 million QPS.

2.2.1 Core Non‑functional Requirements

Low latency : the rate‑limit check must add less than 10 ms overhead per request.

High availability : the limiter is a core infrastructure component and may tolerate eventual consistency.

High concurrency : the system must reliably handle 1 M QPS.

3. Low‑level Design

Define core entities and interfaces first.

3.1 Core Entity Definitions

Rule : defines a rate‑limit policy for a specific scenario (e.g., 1000 requests per hour for authenticated users).

Client : the entity being limited (user, IP, or API key) with an associated rate‑limit state.

Request : an incoming API call carrying context such as client identity, endpoint, and timestamp.

When a request arrives, the system identifies the client, looks up the applicable rule, checks the current usage, and decides to allow or reject.

3.2 System Interface Design

/**
 * Check whether a request is allowed
 * @param clientId unique client identifier (user ID, IP, or API key)
 * @param ruleId   unique rule identifier
 * @return object containing pass flag, remaining count and reset time
 */
isRequestAllowed(clientId, ruleId) -> {
    passes: boolean,      // whether allowed
    remaining: number,    // remaining requests
    resetTime: timestamp // counter reset timestamp
}

4. High‑level Architecture

4.1 Where to Deploy the Rate Limiter?

Interviewer: “Where would you place the limiter in the architecture?”

4.1.1 In‑process Limiting

Each service instance keeps its own counter in memory. This is fast but the state is not shared, leading to inaccurate global limits.

4.1.2 Dedicated Service Limiting

A separate microservice holds a centralized counter. It provides accurate limits but adds network latency and introduces a single‑point‑of‑failure risk.

4.1.3 API Gateway / Load Balancer Layer

Preferred solution: embed the limiter at the entry point (API gateway or load balancer) so every external request is checked before reaching any backend service.

4.2 How to Identify Clients?

Typical identifiers:

User ID : extracted from JWT or session token.

IP address : taken from X‑Forwarded‑For header.

API key : provided via X‑API‑Key header.

In practice, multiple identifiers can be combined to enforce layered rules.

4.3 Rate‑Limiting Algorithm Choice

4.3.1 Fixed Window Counter

Simple hash table mapping client‑window to a count. Easy to implement but suffers from boundary spikes.

{
  "alice:12:00:00": 100,
  "alice:12:01:00": 5,
  "bob:12:00:00": 20
}

4.3.2 Sliding Window Counter

Divides a large window into smaller sub‑windows to smooth spikes. Improves accuracy but still has granularity limits.

4.3.3 Leaky Bucket

Requests fill a bucket that leaks at a fixed rate; excess requests are dropped. Limited flexibility for dynamic traffic.

4.3.4 Token Bucket (Recommended)

Each client has a bucket with a capacity and a refill rate. Tokens are consumed per request; if none are available, the request is rejected. Balances burst handling and steady‑state rate control.

4.4 Storing Token‑Bucket State

Use Redis as a high‑performance, in‑memory store accessible by all gateway instances. Perform atomic read‑modify‑write using a Lua script to avoid race conditions.

MULTI
HSET user:bucket tokens <new_token_count>
HSET user:bucket last_refill <current_timestamp>
EXPIRE user:bucket 3600
EXEC

4.5 High Availability of Redis

Deploy a Redis Cluster with sharding (16384 hash slots) and master‑slave replication. Sentinel or built‑in cluster failover ensures continuity when a node crashes.

4.6 Minimizing Latency

Maintain a connection pool between the gateway and Redis to avoid per‑request TCP handshakes.

Co‑locate gateways and Redis nodes in the same data center or availability zone to keep network round‑trip time minimal.

4.7 Hot‑key Mitigation

Detect abusive clients that repeatedly trigger limits and automatically blacklist them, or provide higher‑tier packages with dedicated resources for legitimate high‑traffic users.

4.8 Dynamic Rule Configuration

Store rate‑limit rules in an external configuration center (e.g., Nacos, Apollo) and let gateways either poll (pull) the config at regular intervals or receive push notifications via a long‑lived connection. Pull is simple; push offers real‑time updates.

5. Summary

The article presents a systematic, interview‑ready design of a million‑QPS rate limiter, covering requirement gathering, entity modeling, algorithm selection, distributed state management with Redis, scalability through sharding, high availability, latency optimization, hot‑key handling, and dynamic configuration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Backend Architecture System Design high concurrency rate limiting Token Bucket

Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.