Rate Limiting

Rate limiting is a defensive mechanism that controls how many requests a client (user, IP address, or API key) can make to your system within a given time window. Without it, a single misbehaving client — whether an attacker or a buggy script — can consume all your server resources, crashing the service for everyone.

Think of it like a highway toll booth. Even if a million cars want to enter the highway at once, the toll booth only allows a fixed number through per minute. The rest must wait in line. Rate limiting is your system's toll booth.

1. Why Rate Limiting Matters

Threat	Without Rate Limiting	With Rate Limiting
DDoS Attacks	Millions of requests flood servers, causing downtime.	Excess requests are rejected with `429 Too Many Requests`.
Brute Force Login	Attacker tries 100,000 password combinations per second.	After 5 failed attempts, the account is locked for 15 minutes.
API Abuse	A buggy client polls your API 1,000 times per second.	Client is throttled to 100 requests per minute.
Cost Control	A single tenant on a shared cloud platform consumes all resources.	Each tenant gets a fair share of server capacity.

2. Where to Place the Rate Limiter

The rate limiter is typically placed at the edge of your system — in the API Gateway or Load Balancer — before requests reach your application servers. This ensures malicious traffic is blocked as early as possible, saving backend resources.

API Gateway (Recommended): Services like AWS API Gateway, Kong, and NGINX have built-in rate limiting. This is the standard production approach.
Application Middleware: Rate limiting logic inside your application code (e.g., Express middleware). Simpler for small projects, but doesn't scale well because each server instance needs shared state.
Client-Side (Complementary): Well-behaved clients implement their own rate limiting (e.g., exponential backoff on 429 responses). This reduces unnecessary network traffic but cannot be relied upon for security.

3. Rate Limiting Algorithms

Algorithm 1: Token Bucket

The Token Bucket is the most widely used rate-limiting algorithm in production. It is used by AWS, Stripe, and most API gateways.

How it Works:

Imagine a bucket that holds a maximum of N tokens (the bucket capacity).
Tokens are added to the bucket at a fixed rate (the refill rate), e.g., 10 tokens per second.
Each incoming request takes one token from the bucket.
If the bucket has tokens, the request is allowed.
If the bucket is empty, the request is rejected (or queued).

Why it's popular: The Token Bucket naturally handles bursts. If a user has been idle, their bucket is full, so they can make a burst of requests. Over time, the refill rate enforces the average limit.

Code

class TokenBucket {
    private tokens: number;
    private lastRefillTime: number;
 
    constructor(
        private maxTokens: number,    // Bucket capacity (e.g., 50)
        private refillRate: number     // Tokens added per second (e.g., 10)
    ) {
        this.tokens = maxTokens;
        this.lastRefillTime = Date.now();
    }
 
    allowRequest(): boolean {
        this.refill();
        if (this.tokens > 0) {
            this.tokens -= 1;
            return true;  // Request allowed
        }
        return false;     // Rate limited
    }
 
    private refill(): void {
        const now = Date.now();
        const elapsed = (now - this.lastRefillTime) / 1000; // seconds
        const tokensToAdd = elapsed * this.refillRate;
        this.tokens = Math.min(this.maxTokens, this.tokens + tokensToAdd);
        this.lastRefillTime = now;
    }
}

Algorithm 2: Fixed Window Counter

The simplest rate-limiting algorithm. It divides time into fixed-size windows (e.g., 1-minute intervals) and counts requests in each window.

How it Works:

Divide time into windows: [00:00 - 00:59], [01:00 - 01:59], etc.
Maintain a counter for each window.
When a request arrives, increment the counter for the current window.
If the counter exceeds the limit, reject the request.
When a new window starts, the counter resets to 0.

Window:  [12:00:00 - 12:00:59]  |  [12:01:00 - 12:01:59]
Limit:   100 requests per minute
Counter:    87                   |        0 (reset)

Pros:

Extremely simple. Only requires storing a single counter and a timestamp.
Very low memory: one counter per client per window.

Cons: The Boundary Problem:

The critical flaw of fixed windows is that a client can make double the allowed requests at the boundary between two windows:

Limit: 100 requests per minute

Window 1: [12:00:00 - 12:00:59]
  └── User sends 0 requests for the first 59 seconds.
  └── User sends 100 requests at 12:00:55 → All allowed

Window 2: [12:01:00 - 12:01:59]
  └── User sends 100 requests at 12:01:00 → All allowed

Result: 200 requests in a 5-second span! The "100 per minute" limit is violated.

Algorithm 3: Sliding Window Log

The Sliding Window Log fixes the boundary problem by tracking the exact timestamp of every single request.

How it Works:

Store every request's timestamp in a sorted list (e.g., a Redis Sorted Set).
When a new request arrives: a. Remove all timestamps older than now - windowSize. b. Count the remaining timestamps. c. If the count is below the limit, allow the request and add its timestamp. d. Otherwise, reject it.

Window size: 60 seconds
Limit: 5 requests

Sorted Log at 12:01:30:
  [12:00:45, 12:01:00, 12:01:10, 12:01:20, 12:01:25]
  └── 5 entries in the last 60 seconds → REJECT next request

Pros:

Perfectly accurate. No boundary problem.

Cons:

Memory-intensive: Stores every timestamp for every client. If a client makes 10,000 requests per minute, you store 10,000 timestamps per client.

Algorithm 4: Sliding Window Counter (Best of Both Worlds)

The Sliding Window Counter is a hybrid approach that combines the low memory of Fixed Windows with the accuracy of Sliding Windows.

How it Works:

Maintain counters for the current window and the previous window.
Calculate a weighted count based on how far into the current window we are.

Limit: 100 requests per minute

Previous window [12:00 - 12:01]: 84 requests
Current window  [12:01 - 12:02]: 36 requests (so far)

Current time: 12:01:15 (25% into the current window)

Estimated count = (previous * overlap%) + current
                = (84 * 0.75) + 36
                = 63 + 36
                = 99   →  Under limit → ALLOW

The overlap% represents how much of the previous window is still relevant. At 12:01:15, we are 15 seconds (25%) into the current window, meaning 75% of the previous window's traffic is still "within range."

Pros:

Low memory (only two counters per client).
Smooths out the boundary problem effectively.

Cons:

An approximation — not perfectly accurate, but close enough for production use. This is the algorithm used by Cloudflare.

4. Distributed Rate Limiting with Redis

In a system with multiple application servers behind a load balancer, you cannot store rate-limit counters in local memory — each server would have its own counter, and a client could bypass the limit by hitting different servers.

The solution is a centralized, shared counter store — typically Redis — that all servers read from and write to.

Atomic Operations with Lua Scripts

A critical concern is race conditions. If two servers read the counter simultaneously, both see count = 99 (under the limit of 100), and both increment it, the actual count becomes 101 — exceeding the limit.

To prevent this, use a Redis Lua script that reads, checks, and increments the counter atomically (as a single, indivisible operation):

Code

-- Redis Lua script for a fixed-window rate limiter
local key = KEYS[1]           -- e.g., "rate_limit:user_123:minute_42"
local limit = tonumber(ARGV[1])  -- e.g., 100
local ttl = tonumber(ARGV[2])    -- e.g., 60 (seconds)
 
local current = tonumber(redis.call("GET", key) or "0")
 
if current >= limit then
    return 0  -- REJECTED
end
 
-- Increment atomically
redis.call("INCR", key)
 
-- Set TTL only if this is the first request in the window
if current == 0 then
    redis.call("EXPIRE", key, ttl)
end
 
return 1  -- ALLOWED

Because Redis executes Lua scripts atomically (single-threaded), there is no race condition — even with hundreds of application servers calling this script concurrently.

5. Handling Rate-Limited Clients

When a client exceeds its rate limit, the server should communicate clearly using standard HTTP headers:

Response Headers

Code

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1718489400
 
{
    "error": "rate_limit_exceeded",
    "message": "You have exceeded 100 requests per minute. Try again in 30 seconds."
}

Header	Meaning
`Retry-After`	Number of seconds the client should wait before retrying.
`X-RateLimit-Limit`	The maximum number of requests allowed in the window.
`X-RateLimit-Remaining`	How many requests the client has left in the current window.
`X-RateLimit-Reset`	The Unix timestamp when the rate limit window resets.

Client-Side Best Practice: Exponential Backoff

Well-behaved clients should respect 429 responses and implement exponential backoff — retrying with increasing delays:

Code

async function fetchWithRetry(url: string, maxRetries = 5): Promise<Response> {
    for (let attempt = 0; attempt < maxRetries; attempt++) {
        const response = await fetch(url);
 
        if (response.status !== 429) {
            return response; // Success or non-retryable error
        }
 
        // Exponential backoff: 1s, 2s, 4s, 8s, 16s (with jitter)
        const retryAfter = response.headers.get("Retry-After");
        const delay = retryAfter
            ? parseInt(retryAfter) * 1000
            : Math.pow(2, attempt) * 1000 + Math.random() * 1000;
 
        console.log(`Rate limited. Retrying in ${delay}ms...`);
        await new Promise(resolve => setTimeout(resolve, delay));
    }
 
    throw new Error("Max retries exceeded");
}

6. Rate Limiting Strategies by Scope

Different rate limits can be applied at different scopes:

Scope	Example	Use Case
Per User / API Key	1,000 requests per user per hour	Standard API rate limiting
Per IP Address	100 requests per IP per minute	Brute force protection for login pages
Per Endpoint	10 requests to `POST /api/payments` per minute	Protecting expensive operations
Global	50,000 requests per second (total)	System-wide circuit breaker

[!IMPORTANT] In production, you should apply multiple layers of rate limiting simultaneously. For example: a global limit of 50k req/s, per-user limit of 100 req/min, and a per-endpoint limit of 5 req/min on POST /api/password-reset.

7. Algorithm Comparison Summary

Algorithm	Accuracy	Memory	Burst Handling	Complexity	Used By
Token Bucket	Good	Low	Yes (natural)	Medium	AWS, Stripe, NGINX
Fixed Window	Poor (boundary)	Very Low	Poor	Simple	Basic implementations
Sliding Window Log	Perfect	High	No	Medium	High-security systems
Sliding Window Counter	Good	Low	No	Medium	Cloudflare

[!TIP] For system design interviews, Token Bucket is the safest default answer. It handles bursts gracefully, uses minimal memory, and is used by the majority of production API gateways. Mention the boundary problem of Fixed Windows to demonstrate depth.