Rate Limiting Algorithms

Your API receives 10,000 requests per second from a single client. Without rate limiting, that client’s traffic saturates your database connection pool, causing timeouts for every other user. Rate limiting caps how many requests a client can make in a given time period — protecting backend resources, enforcing fair usage, and preventing abuse.

The algorithm you choose determines the precision, memory cost, burst tolerance, and implementation complexity of your rate limiter.

Fixed Window Counter

Divide time into fixed-duration windows (e.g., 1-minute intervals). Maintain a counter per client per window. Increment on each request. Reject when the counter exceeds the limit.

Window:  [12:00:00 — 12:01:00]   [12:01:00 — 12:02:00]
Counter:       0 → 1 → ... → 100      0 → 1 → ...

Request at 12:00:45, counter=99 → ALLOW (counter → 100)
Request at 12:00:46, counter=100 → REJECT (limit reached)
Request at 12:01:00, counter resets → ALLOW (counter → 1)

Redis implementation:

def fixed_window(client_id, limit, window_seconds):
    key = f"rate:{client_id}:{int(time.time()) // window_seconds}"
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, window_seconds)
    return count <= limit

The Boundary Burst Problem

A client can send limit requests at the end of one window and limit requests at the start of the next — effectively 2× the intended rate across a short period.

Limit: 100 requests per minute

Window 1: [12:00 — 12:01]  Window 2: [12:01 — 12:02]
           ............▓▓▓▓▓|▓▓▓▓▓............
           100 requests     | 100 requests
           in last 5 sec    | in first 5 sec

= 200 requests in a 10-second span — double the intended rate

This is acceptable for many use cases but dangerous for resource-sensitive operations like payment processing.

Sliding Window Log

Store the exact timestamp of every request in a sorted log. On each new request, remove timestamps older than the window, count remaining entries, and allow or reject.

sequenceDiagram
    participant C as Client
    participant RL as Rate Limiter
    participant Log as Timestamp Log (sorted set)

    C->>RL: Request at t=1000.5
    RL->>Log: Remove entries where timestamp < 1000.5 - 60
    RL->>Log: Count remaining entries
    Log->>RL: count = 97
    RL->>RL: 97 < 100 limit → ALLOW
    RL->>Log: Add timestamp 1000.5
    RL->>C: 200 OK

Redis sorted set implementation:

def sliding_window_log(client_id, limit, window_seconds):
    key = f"rate:{client_id}"
    now = time.time()
    window_start = now - window_seconds

    pipe = redis.pipeline()
    pipe.zremrangebyscore(key, 0, window_start)  # remove expired
    pipe.zadd(key, {str(now): now})               # add current
    pipe.zcard(key)                                # count
    pipe.expire(key, window_seconds)               # TTL cleanup
    _, _, count, _ = pipe.execute()

    return count <= limit

Precision: eliminates the boundary burst problem — the window slides with every request.

Cost: O(N) memory per client where N = number of requests in the window. For a high-volume API (10K requests/minute per client), this stores 10K timestamps per client. At scale, this is expensive.

Sliding Window Counter

An approximation that combines the simplicity of fixed windows with the smoothness of a sliding window. Maintain counters for the current and previous fixed windows. Weight the previous window’s counter by the overlap fraction.

Current window:  [12:01:00 — 12:02:00]  counter = 40
Previous window: [12:00:00 — 12:01:00]  counter = 85
Time now: 12:01:15 → 25% into current window → 75% overlap with previous

Estimated count = previous × (1 - elapsed%) + current
                = 85 × 0.75 + 40
                = 63.75 + 40 = 103.75 → REJECT (limit = 100)

def sliding_window_counter(client_id, limit, window_seconds):
    now = time.time()
    current_window = int(now) // window_seconds
    previous_window = current_window - 1
    elapsed = (now % window_seconds) / window_seconds

    current_count = int(redis.get(f"rate:{client_id}:{current_window}") or 0)
    previous_count = int(redis.get(f"rate:{client_id}:{previous_window}") or 0)

    estimated = previous_count * (1 - elapsed) + current_count
    if estimated >= limit:
        return False

    redis.incr(f"rate:{client_id}:{current_window}")
    redis.expire(f"rate:{client_id}:{current_window}", window_seconds * 2)
    return True

Memory: O(2) keys per client — constant regardless of request volume. This is the sweet spot for most production rate limiters.

Accuracy: an approximation, not exact. The error is small and bounded — Cloudflare uses this algorithm for their rate limiting layer.

Token Bucket

A bucket holds tokens up to a maximum capacity B. Tokens are added at a fixed rate R (e.g., 10 tokens/second). Each request consumes one token. If the bucket is empty, the request is rejected.

flowchart LR
    R[Token Refill
rate R = 10/sec] -->|drip| Bucket["Bucket
capacity B = 50
current: 35 tokens"]
    Bucket -->|consume 1| Allow[ALLOW request]
    Bucket -->|empty| Reject[REJECT 429]

Time 0:   bucket = 50 (full)
Time 0-3: 45 requests arrive → bucket = 5
Time 3-4: 10 tokens refilled → bucket = 15
Time 4:   20 requests arrive → 15 allowed, 5 rejected → bucket = 0
Time 5:   10 tokens refilled → bucket = 10

Key property: token bucket allows bursts up to B while maintaining a long-term average rate of R. This is what makes it the most practical algorithm — it handles real traffic patterns where requests arrive in clusters.

def token_bucket(client_id, rate, capacity):
    key = f"bucket:{client_id}"
    now = time.time()

    bucket = redis.hgetall(key)
    tokens = float(bucket.get("tokens", capacity))
    last_refill = float(bucket.get("last_refill", now))

    # Refill tokens based on elapsed time
    elapsed = now - last_refill
    tokens = min(capacity, tokens + elapsed * rate)

    if tokens < 1:
        return False  # reject

    tokens -= 1
    redis.hset(key, mapping={"tokens": tokens, "last_refill": now})
    redis.expire(key, int(capacity / rate) + 1)
    return True

Used by: AWS API Gateway, Stripe, GitHub API. Most real-world rate limiters use token bucket or a variant.

⚠️

Race condition in naive implementations. The read-modify-write cycle (read current tokens → compute refill → decrement → write back) is not atomic. Two concurrent requests can both read tokens=1, both decrement, and both succeed — allowing 2 requests when only 1 should pass. Use Redis Lua scripts or MULTI/EXEC to make the operation atomic.

Leaky Bucket

Requests enter a FIFO queue (the bucket). The bucket drains at a fixed rate. If the queue is full, new requests are dropped.

Requests arrive: [R1, R2, R3, R4, R5] at once
Queue capacity: 3
Drain rate: 1 request/second

t=0:  Queue: [R1, R2, R3]  R4, R5 → DROPPED
t=1:  Process R1, Queue: [R2, R3]
t=2:  Process R2, Queue: [R3]
t=3:  Process R3, Queue: []

Difference from token bucket: leaky bucket enforces a constant output rate with no bursts. Token bucket allows bursts up to capacity. Leaky bucket smooths traffic; token bucket accommodates bursty traffic.

Used by: network traffic shaping, NGINX limit_req with burst and nodelay parameters.

Comparison

Algorithm	Memory	Precision	Burst Handling	Complexity	Best For
Fixed window	O(1) per client	Low (boundary burst)	None — hard cutoff at window	Lowest	Simple API quotas
Sliding window log	O(N) per client	Exact	None	Medium	Low-volume, high-precision needs
Sliding window counter	O(1) per client	Approximate	None	Low	Production rate limiters at scale
Token bucket	O(1) per client	Exact	Allows bursts up to B	Medium	Most real-world APIs (AWS, Stripe)
Leaky bucket	O(queue size)	Exact	Smooths to constant rate	Medium	Traffic shaping, network QoS

Distributed Rate Limiting

A single-node rate limiter is straightforward — all request counters live in one process. In a distributed system with multiple API gateway instances, you need shared state.

Centralized Counter (Redis)

All gateway instances increment the same Redis key. Atomic operations (INCR, Lua scripts) ensure correctness.

flowchart LR
    G1[Gateway 1] --> R[(Redis)]
    G2[Gateway 2] --> R
    G3[Gateway 3] --> R
    R -->|count ≤ limit| Allow
    R -->|count > limit| Reject[429 Too Many Requests]

Atomic sliding window counter in Lua:

-- KEYS[1] = rate limit key
-- ARGV[1] = window (seconds), ARGV[2] = limit, ARGV[3] = now (timestamp)
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local count = redis.call('ZCARD', key)

if count < limit then
    redis.call('ZADD', key, now, now .. math.random())
    redis.call('EXPIRE', key, window)
    return 1  -- allowed
else
    return 0  -- rejected
end

Per-Node with Sync

Each gateway maintains a local counter and periodically synchronizes with a central store. Faster (no Redis RTT per request) but allows brief over-limit windows during sync gaps.

Approach	Latency per check	Accuracy	Redis failure mode
Centralized	+1 RTT to Redis (~1ms local, ~10ms cross-region)	Exact	Fail open (allow all) or fail closed (reject all)
Per-node + sync	Local (µs)	Approximate during sync gaps	Degrades to per-node limiting

Response Headers

Standard rate limit headers inform clients of their current state:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 42
X-RateLimit-Reset: 1680307200

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0

ℹ️

Interview tip: When designing a rate limiter, state the algorithm and why: “I’d use a token bucket with Redis — it allows burst traffic up to the bucket capacity while maintaining a long-term average rate. The bucket state is two fields (tokens, last_refill) stored in a Redis hash, checked atomically via a Lua script. For fault tolerance, I’d fail open on Redis failure — briefly allowing extra traffic is better than rejecting all requests.” This shows you’ve thought about the algorithm choice, the distributed coordination, and the failure mode.

Test Your Understanding

A fixed-window rate limiter allows 100 requests per minute. A client sends 100 requests in the last second of minute 1 and 100 in the first second of minute 2. They sent 200 requests in 2 seconds. Is this a bug?

The boundary burst problem — a known weakness of fixed-window counters. Each minute’s counter resets to 0 at the boundary. The client legitimately stays within each window’s limit but achieves 2x the intended rate by clustering requests around the boundary.

Fix: Sliding window counter — weight the current window’s count plus a fraction of the previous window’s count based on elapsed time. This smooths the boundary and limits effective burst to ~1.5x. Or use a token bucket, which naturally limits burst to the bucket capacity.

Your rate limiter uses Redis INCR + EXPIRE for a sliding window. The Redis node fails. Should you fail open (allow all traffic) or fail closed (reject all traffic)?

Fail open for most services. Rejecting all traffic during a Redis outage turns a rate-limiter failure into a full service outage — worse than the problem rate limiting solves.

Fail closed only for: (1) payment endpoints where unlimited traffic could cause financial damage, (2) endpoints that are DDoS targets where removing rate limits exposes the backend to overload.

Mitigation: Run Redis with Sentinel or Cluster for HA. Cache the last known rate limit state locally as a fallback. Or use per-node rate limiters (no coordination, less precise but no SPOF).

Token bucket vs sliding window log — which uses more memory, and which is more precise?

Sliding window log: Stores a timestamp for every request within the window. For a user making 1000 requests/minute, that’s 1000 timestamps in memory. Perfectly precise — you know the exact request count in any trailing window. Memory: O(requests per user).

Token bucket: Stores only 2 values per user: current token count and last refill timestamp. Memory: O(1) per user regardless of request rate. Less precise — it allows bursts up to the bucket capacity, which a pure sliding window wouldn’t.

In practice: Token bucket for most use cases (low memory, burst-friendly). Sliding window log only when you need exact precision and can afford O(requests) memory per user.