What's the difference between rate limiting and throttling?

Rate limiting counts and rejects requests that exceed defined limits, while throttling delays or queues excessive requests. Rate limiting provides hard limits and immediate feedback, while throttling smooths traffic flow but can introduce latency.

Should I implement rate limiting in my application or use an API gateway?

Both have advantages. API gateways provide centralized management and work across all services, but application-level rate limiting offers more granular control based on business logic, user context, and specific operation costs. Many systems use both.

How do I handle rate limiting in mobile apps?

Implement exponential backoff with jitter, respect retry-after headers, and cache data aggressively to reduce API calls. Consider offline-first approaches and batch operations when possible. Always inform users when rate limits affect functionality.

What rate limits should I set for a new API?

Start with conservative limits based on your capacity: 1000 requests per hour for general use, 100 per hour for authenticated users, and 10 per minute for expensive operations. Monitor usage patterns and adjust based on actual traffic and system performance.

How do I prevent rate limiting from affecting legitimate users?

Use multiple rate limiting keys (IP + user ID), implement different limits for different user types, provide clear error messages with guidance, and monitor for false positives. Consider implementing reputation-based systems for trusted users.

What's the best way to test rate limiting?

Use load testing tools like Apache Bench or Artillery to generate traffic patterns. Test edge cases like burst traffic, concurrent requests, and boundary conditions. Verify that rate limiting doesn't impact legitimate usage under normal conditions.

Rate Limiting and Throttling Patterns for System Design

Key Takeaways

1.Token bucket and sliding window algorithms are the most common rate limiting patterns, each with different trade-offs for burst handling
2.Rate limiting can be implemented at multiple layers: API gateway, load balancer, application, and database levels
3.Redis is the most popular choice for distributed rate limiting due to its atomic operations and low latency
4.Modern systems combine multiple algorithms (hybrid approach) for optimal performance and fairness
5.Proper rate limiting prevents DDoS attacks, ensures fair resource usage, and maintains service quality under load

Table of Contents

99%

DDoS Prevention

60%

Resource Savings

40%

Response Time Improvement

What is Rate Limiting and Why It Matters

Rate limiting is a technique used to control the number of requests a user, IP address, or service can make to an API or system within a specific time window. It acts as a traffic control mechanism, preventing resource exhaustion and ensuring fair usage across all clients.

Unlike simple throttling (which just delays requests), rate limiting can reject excessive requests entirely. This makes it crucial for protecting against DDoS attacks, preventing abuse, and maintaining service quality under high load conditions.

Modern systems implement rate limiting at multiple layers to create defense in depth. Companies like Twitter limit API calls per user, GitHub limits Git operations per repository, and cloud providers limit API requests per account. Without proper rate limiting, a single misbehaving client can bring down an entire service.

95%

Attack Mitigation

Source: of DDoS attacks are stopped by proper rate limiting implementation

Core Rate Limiting Algorithms Explained

There are four main rate limiting algorithms, each with different characteristics for handling burst traffic and maintaining fairness:

Token Bucket

Maintains a bucket of tokens that refill at a constant rate. Each request consumes a token. Allows burst traffic up to bucket capacity.

Key Skills

Burst handlingSmooth rate limitingMemory efficient

Common Jobs

• Backend Engineer
• Systems Architect

Sliding Window

Tracks request timestamps within a moving time window. More accurate than fixed windows but requires more memory.

Key Skills

Precise countingMemory managementTimestamp tracking

Common Jobs

• Platform Engineer
• API Developer

Fixed Window

Counts requests within fixed time intervals (e.g., per minute). Simple but allows burst traffic at window boundaries.

Key Skills

Simple implementationLow memoryCounter-based

Common Jobs

• Full Stack Developer
• DevOps Engineer

Sliding Window Log

Stores individual request timestamps in a log. Most accurate but highest memory overhead.

Key Skills

Perfect accuracyLog managementStorage optimization

Common Jobs

• Senior Backend Engineer
• Performance Engineer

Token Bucket Algorithm Deep Dive

The token bucket algorithm is the most popular choice for rate limiting because it naturally handles burst traffic while maintaining long-term rate limits. Here's how it works:

python

import time
import threading

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity        # Max tokens
        self.tokens = capacity          # Current tokens
        self.refill_rate = refill_rate  # Tokens per second
        self.last_refill = time.time()
        self.lock = threading.Lock()
    
    def allow_request(self, tokens_needed=1):
        with self.lock:
            # Refill tokens based on elapsed time
            now = time.time()
            elapsed = now - self.last_refill
            tokens_to_add = elapsed * self.refill_rate
            
            self.tokens = min(self.capacity, 
                            self.tokens + tokens_to_add)
            self.last_refill = now
            
            # Check if request can be allowed
            if self.tokens >= tokens_needed:
                self.tokens -= tokens_needed
                return True
            return False

This implementation allows burst traffic up to the bucket capacity while ensuring the long-term rate doesn't exceed the refill rate. It's memory efficient (O(1) per user) and handles edge cases like long idle periods gracefully.

Rate Limiting Implementation Patterns

Rate limiting can be implemented using different architectural patterns depending on your system's requirements and constraints. Here are the most common approaches used in production systems.

Pattern	Pros	Cons	Best For
In-Memory	Fastest, simple	Not distributed, lost on restart	Single instance apps
Redis-based	Distributed, persistent, atomic ops	Network latency, single point of failure	Multi-instance production
Database-based	Persistent, consistent	Slow, high database load	Low-traffic applications
API Gateway	Centralized, no app changes	Vendor lock-in, limited customization	Microservices architecture

Redis-Based Rate Limiting Implementation

Redis is the most popular choice for distributed rate limiting because of its atomic operations and sub-millisecond latency. Here's a production-ready sliding window implementation:

lua

-- Sliding window rate limiting script
local key = KEYS[1]           -- Rate limit key (e.g., "user:123")
local window = tonumber(ARGV[1])     -- Time window in seconds
local limit = tonumber(ARGV[2])      -- Max requests per window
local now = tonumber(ARGV[3])        -- Current timestamp

-- Remove expired entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count current requests in window
local current = redis.call('ZCARD', key)

if current < limit then
    -- Add current request
    redis.call('ZADD', key, now, now)
    redis.call('EXPIRE', key, window)
    return {1, limit - current - 1}  -- [allowed, remaining]
else
    return {0, 0}  -- [denied, remaining]
end

This Lua script runs atomically on Redis, preventing race conditions in high-concurrency scenarios. The sliding window approach provides more accurate rate limiting than fixed windows, preventing the 'burst at boundary' problem.

Where to Apply Rate Limiting in Your Architecture

Rate limiting can be implemented at multiple layers of your system architecture. Each layer serves different purposes and provides varying levels of protection and granularity.

CDN/Edge Level: Cloudflare, AWS CloudFront - Blocks malicious traffic before it reaches your infrastructure
Load Balancer: Nginx, HAProxy - Rate limit by IP address or geographic region at the entry point
API Gateway: Kong, AWS API Gateway - Centralized rate limiting with authentication context
Application Level: Express.js middleware, Django decorators - Business logic aware, user-specific limits
Database Level: Connection pooling, query rate limiting - Protects your most critical resource

The key is implementing complementary limits at each layer. For example, you might have a global IP limit at the load balancer (1000 req/min), authenticated user limits at the API gateway (100 req/min per user), and endpoint-specific limits in your application (10 req/min for expensive operations).

Distributed Rate Limiting Challenges and Solutions

In distributed systems, rate limiting becomes more complex because state must be shared across multiple application instances. The naive approach of per-instance limits doesn't work because users can bypass limits by hitting different servers.

Centralized State Approach: Use Redis or similar to maintain shared counters. This provides perfect accuracy but introduces latency and potential single points of failure.

Gossip Protocol Approach: Each instance maintains local counters and periodically shares updates with other instances. This reduces latency but sacrifices accuracy for eventually consistent rate limiting.

Hybrid Approach: Combine local and global limits. Allow some requests locally but check global state for expensive operations. This balances accuracy with performance.

2-5ms

Redis Latency

Source: Typical Redis round-trip time for rate limiting checks in production

Which Should You Choose?

Choose Token Bucket when...

You need to handle burst traffic naturally
Long-term rate limiting is more important than short-term spikes
Memory efficiency is crucial (O(1) per user)
You want industry-standard approach (used by AWS, GCP)

Choose Sliding Window when...

Precise rate limiting is critical
You need to prevent boundary condition exploits
Memory usage is acceptable for accuracy gains
Compliance or SLA requirements demand exact limits

Choose Fixed Window when...

Simplicity is more important than perfect accuracy
Memory usage must be minimal
Some burst traffic at boundaries is acceptable
You're implementing rate limiting for the first time

Choose Hybrid Approach when...

Different endpoints have different characteristics
You need both burst handling and precise limits
Building enterprise-grade systems with SLAs
Performance and accuracy are both critical

Rate Limiting Best Practices

1. Implement Graceful Degradation

When rate limits are hit, return meaningful error messages with retry-after headers. Don't just return generic 429 errors.

2. Use Different Limits for Different Operations

Read operations can have higher limits than write operations. Expensive operations like search should have lower limits than simple data retrieval.

3. Implement Rate Limiting Headers

Always return X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers so clients can adjust behavior proactively.

4. Monitor and Alert

Track rate limiting metrics: hit rates, false positives, and impact on legitimate users. Alert when limits are consistently hit.

5. Provide Rate Limit Exemptions

Have a mechanism to whitelist trusted IPs or provide higher limits for premium users. Include emergency bypass capabilities.

6. Test Under Load

Rate limiting behavior changes under high load. Test your implementation with realistic traffic patterns and concurrent users.

Rate Limiting in Modern Frameworks

Most modern web frameworks provide built-in rate limiting middleware or easy integration with external solutions:

javascript

// Express.js with express-rate-limit
const rateLimit = require('express-rate-limit');

const limiter = rateLimit({
  windowMs: 15 * 60 * 1000, // 15 minutes
  max: 100, // Limit each IP to 100 requests per windowMs
  message: {
    error: 'Too many requests from this IP',
    retryAfter: 900 // seconds
  },
  standardHeaders: true, // Return rate limit info in headers
  legacyHeaders: false,
});

// Apply to all requests
app.use(limiter);

// Apply to specific routes with different limits
app.use('/api/auth', rateLimit({
  windowMs: 15 * 60 * 1000,
  max: 5 // Stricter limit for auth endpoints
}));

For distributed systems, integrate with Redis for shared state across instances. This ensures consistent rate limiting regardless of which server handles the request.

Rate Limiting FAQ

Related Career Paths

Career

Software Engineer Salary Guide

Career

DevOps Engineer Salary Guide

Career

From Junior to Senior: Timeline

Career

Staff Engineer Guide

Related Degree Programs

Degree

Computer Science Degree Hub

Degree

Software Engineering Degree Hub

Degree

Computer Engineering Degree Hub

Taylor Rupe

Full-Stack Developer (B.S. Computer Science, B.A. Psychology)

Taylor combines formal training in computer science with a background in human behavior to evaluate complex search, AI, and data-driven topics. His technical review ensures each article reflects current best practices in semantic search, AI systems, and web technology.

Rate Limiting and Throttling Patterns for System Design

What is Rate Limiting and Why It Matters

Core Rate Limiting Algorithms Explained

Key Skills

Common Jobs

Key Skills

Common Jobs

Key Skills

Common Jobs

Key Skills

Common Jobs

Token Bucket Algorithm Deep Dive

Rate Limiting Implementation Patterns

Redis-Based Rate Limiting Implementation

Where to Apply Rate Limiting in Your Architecture

Distributed Rate Limiting Challenges and Solutions

Which Should You Choose?

Rate Limiting Best Practices

1. Implement Graceful Degradation

2. Use Different Limits for Different Operations

3. Implement Rate Limiting Headers

4. Monitor and Alert

5. Provide Rate Limit Exemptions

6. Test Under Load

Rate Limiting in Modern Frameworks

Rate Limiting FAQ

What's the difference between rate limiting and throttling?

Should I implement rate limiting in my application or use an API gateway?

How do I handle rate limiting in mobile apps?

What rate limits should I set for a new API?

How do I prevent rate limiting from affecting legitimate users?

What's the best way to test rate limiting?

Related Engineering Articles

Related Career Paths

Related Degree Programs

Taylor Rupe