MCP server rate limiting deep dive: sliding window, token bucket, and adaptive throttling

Rate limiting is the single most effective control for preventing an MCP server from being weaponised by a misbehaving or compromised LLM. This guide covers three production-ready algorithms — sliding window, token bucket, and adaptive throttling — with complete implementations, memory profiles, and the tradeoffs that determine which one fits your server.

Why rate limiting is a security control, not just a reliability control

Most rate limiting discussions start and end with "prevent abuse." In the MCP context, the threat model is different. The consumer of your tool is an LLM — an entity that can be instructed via prompt injection to call your tool in a tight loop, or that can enter a retry loop of its own when a downstream service is slow.

Without a rate limit, a single compromised or misbehaving agent session can:

This is why SkillAudit's rate limiting check is scored under the Security axis, not the Maintenance axis. A server without any rate limiting starts at a C grade regardless of other findings. Three or more expensive tools with no limits is an automatic D.

Which algorithm to choose

Three algorithms cover nearly all MCP server use cases. The choice is driven by two factors: whether you need to allow short bursts, and whether you want to respond to server load rather than just request volume.

Algorithm Burst handling Memory per session Latency overhead Best for
Sliding window No burst O(n) per window ~0.1ms Strict per-session call quotas, audit-trail compliance
Token bucket Burst allowed O(1) per session ~0.05ms API-wrapping tools, tools with legitimate burst patterns
Adaptive throttling Burst limited by load O(1) global ~0.2ms Shared resource tools, tools calling expensive external APIs

If you're choosing for the first time: start with the sliding window. It's the easiest to reason about, the easiest to explain to users in error messages, and the easiest to audit. Move to token bucket when you have evidence that users are legitimately blocked by burst behaviour. Add adaptive throttling only when you have a shared resource that can become a bottleneck.

Algorithm 1: Sliding window

Most accurate

Count calls in a rolling time window. The limit resets continuously rather than at fixed boundaries, preventing burst abuse at window edges.

The fixed window counter is the naive approach — count calls in each 60-second bucket. Its flaw is that a client can make N calls at 00:59 and N more at 01:01 and never trigger the limit, yet hammer you 2N times in two seconds. The sliding window fixes this by counting calls that occurred in the last N seconds from now, not from the start of the current bucket.

Implementation uses a per-session circular buffer of timestamps. When a call arrives, pop all timestamps older than the window, check the count, then push the new timestamp.

class SlidingWindowLimiter {
  constructor({ windowMs = 60_000, maxCalls = 20 } = {}) {
    this.windowMs = windowMs
    this.maxCalls = maxCalls
    this.sessions = new Map() // sessionId → timestamp[]
  }

  check(sessionId) {
    const now = Date.now()
    const cutoff = now - this.windowMs

    let timestamps = this.sessions.get(sessionId)
    if (!timestamps) {
      timestamps = []
      this.sessions.set(sessionId, timestamps)
    }

    // drop calls outside the window
    while (timestamps.length > 0 && timestamps[0] <= cutoff) {
      timestamps.shift()
    }

    if (timestamps.length >= this.maxCalls) {
      const oldestInWindow = timestamps[0]
      const retryAfterMs = oldestInWindow + this.windowMs - now
      return { allowed: false, retryAfterMs }
    }

    timestamps.push(now)
    return { allowed: true, retryAfterMs: 0 }
  }

  // call in onSessionClose to prevent memory leak
  evict(sessionId) {
    this.sessions.delete(sessionId)
  }
}

// wire into your tool handler
const limiter = new SlidingWindowLimiter({ windowMs: 60_000, maxCalls: 20 })

server.tool('scan_repo', async (args, { sessionId }) => {
  const { allowed, retryAfterMs } = limiter.check(sessionId)
  if (!allowed) {
    throw new Error(
      `Rate limit exceeded. Try again in ${Math.ceil(retryAfterMs / 1000)}s.`
    )
  }
  return runScan(args)
})

The memory cost is bounded by maxCalls timestamps per active session — typically a few hundred bytes. If a session makes fewer calls than the limit, you're storing fewer timestamps. At 20 calls per window, each session uses at most 160 bytes (20 × 8 bytes for 64-bit timestamps). For 10,000 concurrent sessions that's ~1.6MB: entirely negligible.

Memory leak pattern to avoid: The evict(sessionId) call in the onSessionClose hook is not optional. Sessions that disconnect without cleanup accumulate until your process restarts. Wire it up even if your sessions are short-lived — you won't always control when the LLM framework closes a session cleanly.

The sliding window produces the most accurate rate limiting behaviour. It's also the most auditable: when a user asks "how many calls did I make in the last minute?", the answer is simply limiter.sessions.get(sessionId)?.length ?? 0. This transparency is why it's the preferred algorithm for tools under SOC 2 or GDPR audit-trail requirements.

Algorithm 2: Token bucket

Burst-friendly

Each session holds a bucket of tokens. Calls consume tokens; tokens refill at a fixed rate. Bursts are allowed up to the bucket capacity.

Token bucket is the algorithm used by most commercial rate-limiting services (AWS API Gateway, Cloudflare, Google Cloud Endpoints). Its defining feature is that unused capacity accumulates up to a ceiling. A session that makes no calls for 60 seconds then wants to run a batch of 10 calls in quick succession can do so — because it has 10 tokens saved up.

This matters for MCP tools that wrap batch-capable APIs. If your tool calls the GitHub API and users legitimately run "check 10 repos" commands, a sliding window will block them after the first 5 if you've set the limit to 5/minute. A token bucket with a 10-token capacity and a 1-token-per-6-seconds refill rate would allow the burst and then smooth out subsequent calls.

class TokenBucketLimiter {
  constructor({
    capacity = 10,        // max burst size
    refillRatePerMs = 1 / 6000  // 1 token per 6 seconds
  } = {}) {
    this.capacity = capacity
    this.refillRatePerMs = refillRatePerMs
    this.buckets = new Map() // sessionId → { tokens, lastRefill }
  }

  _getBucket(sessionId) {
    let bucket = this.buckets.get(sessionId)
    if (!bucket) {
      bucket = { tokens: this.capacity, lastRefill: Date.now() }
      this.buckets.set(sessionId, bucket)
    }
    return bucket
  }

  check(sessionId) {
    const now = Date.now()
    const bucket = this._getBucket(sessionId)

    // refill tokens based on elapsed time
    const elapsed = now - bucket.lastRefill
    const refilled = elapsed * this.refillRatePerMs
    bucket.tokens = Math.min(this.capacity, bucket.tokens + refilled)
    bucket.lastRefill = now

    if (bucket.tokens < 1) {
      const tokensNeeded = 1 - bucket.tokens
      const retryAfterMs = Math.ceil(tokensNeeded / this.refillRatePerMs)
      return { allowed: false, retryAfterMs }
    }

    bucket.tokens -= 1
    return { allowed: true, retryAfterMs: 0 }
  }

  evict(sessionId) {
    this.buckets.delete(sessionId)
  }
}

// example: 10-call burst, then 1 call per 6 seconds sustained
const limiter = new TokenBucketLimiter({
  capacity: 10,
  refillRatePerMs: 1 / 6000
})

The memory cost is O(1) per session — just two numbers (token count and last refill timestamp). This is more efficient than the sliding window for high-concurrency servers. The tradeoff is that the burst allowance can be surprising to users who expect strict per-minute limits. Document your limits explicitly in the tool's description string and in error messages.

One gotcha specific to MCP: the LLM itself can trigger retry loops. If your error message says "retry in X seconds," some models will do exactly that in a tight loop. Always include a Retry-After value in the error and consider whether your framework exposes it as a structured field rather than free text in the error string.

// return structured error so the LLM gets clear signal
if (!allowed) {
  return {
    content: [{
      type: 'text',
      text: JSON.stringify({
        error: 'rate_limit_exceeded',
        retry_after_seconds: Math.ceil(retryAfterMs / 1000),
        message: `Tool call limit reached. Wait ${Math.ceil(retryAfterMs / 1000)}s before retrying.`
      })
    }],
    isError: true
  }
}

Why isError: true and not throw here? Rate limit errors are recoverable — the LLM should see the retry-after value and pause. A thrown exception signals an unrecoverable tool failure and may cause some frameworks to abandon the session. See our guide on when to reject vs return an error for the full decision tree.

Algorithm 3: Adaptive throttling

Load-aware

Adjust the acceptance rate based on real-time server load or downstream API latency. Automatically throttles harder when the system is stressed.

Sliding window and token bucket operate on request count alone. They don't know whether your server is currently at 10% CPU or 90% CPU, or whether the downstream API you're calling just started returning slow responses. Adaptive throttling closes that gap.

The algorithm Google describes in their SRE book uses the following formula:

// Google's client-side adaptive throttle formula:
// p(throttle) = max(0, (requests - K * accepts) / (requests + 1))
// where K is a multiplier controlling aggressiveness (2.0 is the SRE default)

class AdaptiveThrottler {
  constructor({ K = 2.0, windowMs = 10_000 } = {}) {
    this.K = K
    this.windowMs = windowMs
    this.requests = 0      // total requests in window
    this.accepts = 0       // accepted requests in window
    this.windowStart = Date.now()
  }

  _maybeReset() {
    const now = Date.now()
    if (now - this.windowStart >= this.windowMs) {
      this.requests = 0
      this.accepts = 0
      this.windowStart = now
    }
  }

  check() {
    this._maybeReset()
    this.requests++

    const throttleProb = Math.max(
      0,
      (this.requests - this.K * this.accepts) / (this.requests + 1)
    )

    if (Math.random() < throttleProb) {
      return { allowed: false, throttleProb }
    }

    this.accepts++
    return { allowed: true, throttleProb }
  }

  // call this when the downstream request succeeds
  recordSuccess() {
    // accepts is already incremented in check() — no-op here
    // extend if you want success/failure-weighted accepts
  }

  // call this when the downstream request fails or is slow
  recordBackpressure() {
    // penalise the accepts count to increase throttle probability
    this.accepts = Math.max(0, this.accepts - 1)
  }
}

The key insight is that the throttle probability starts at 0 (no throttling when the server is healthy) and rises as the ratio of accepted-to-total requests falls. When downstream latency spikes, you call recordBackpressure() to decrease the accepts count, which increases the throttle probability for subsequent requests — automatically shedding load before the downstream service times out.

Adaptive throttling is a global limiter, not a per-session limiter. Wire it at the server level and short-circuit tool dispatch before the session-level rate check:

const globalThrottler = new AdaptiveThrottler({ K: 2.0, windowMs: 10_000 })

// middleware applied to every tool call
function dispatchWithThrottling(toolFn) {
  return async (args, context) => {
    const { allowed } = globalThrottler.check()
    if (!allowed) {
      return {
        content: [{ type: 'text', text: '{"error":"server_overloaded","retry_after_seconds":5}' }],
        isError: true
      }
    }

    try {
      const result = await toolFn(args, context)
      globalThrottler.recordSuccess()
      return result
    } catch (err) {
      if (err.code === 'ETIMEDOUT' || err.statusCode === 429) {
        globalThrottler.recordBackpressure()
      }
      throw err
    }
  }
}

In practice, most MCP servers don't need adaptive throttling until they're handling dozens of concurrent sessions. It adds meaningful complexity — the K parameter needs tuning for your workload, and the probabilistic nature means a session can be throttled even when the overall acceptance rate is high. Reserve it for tools that wrap expensive external APIs (LLM inference, database queries with full-table scans, file conversion services) where a sudden spike in calls would cascade into downstream failures.

Combining algorithms: layered rate limiting

Production MCP servers typically layer multiple limits. The pattern is: apply the global adaptive throttle first (shed load before it hits per-session logic), then apply a per-session token bucket (allow burst), and finally apply a per-tool sliding window for expensive tools that have their own downstream limits.

// layered rate limiting — apply in order, short-circuit on first rejection
async function dispatchTool(toolName, toolFn, args, context) {
  const { sessionId } = context

  // layer 1: global adaptive (shed load under stress)
  const global = globalThrottler.check()
  if (!global.allowed) {
    return rateError('server_overloaded', 5)
  }

  // layer 2: per-session token bucket (allow burst, smooth sustained)
  const session = sessionBucket.check(sessionId)
  if (!session.allowed) {
    return rateError('session_rate_limit', Math.ceil(session.retryAfterMs / 1000))
  }

  // layer 3: per-tool sliding window (tool-specific expensive limits)
  const perTool = toolLimiters[toolName]?.check(sessionId)
  if (perTool && !perTool.allowed) {
    return rateError(`${toolName}_rate_limit`, Math.ceil(perTool.retryAfterMs / 1000))
  }

  return toolFn(args, context)
}

function rateError(code, retryAfterSeconds) {
  return {
    content: [{
      type: 'text',
      text: JSON.stringify({ error: code, retry_after_seconds: retryAfterSeconds })
    }],
    isError: true
  }
}

What SkillAudit grades

Our rate limiting audit check inspects tool handlers for rate-limiting patterns. The grading works like this:

Finding Grade impact Why
No rate limiting on any tool −20 pts Billing attack and abuse surface with no mitigation
Rate limiting present but no eviction on session close −10 pts Memory leak; limiter degrades over time
Fixed window counter (not sliding window) −5 pts Window-edge burst bypass
Rate limit error reveals internal state (e.g., "you called 18/20 times") −5 pts Enumeration oracle; tells attacker exactly how much headroom remains
Sliding window or token bucket with session eviction +5 pts Baseline protection; no memory leak
Layered limits (per-session + per-tool) +5 pts Defence in depth; individual expensive tools protected

One finding that surprises authors: telling users their exact current call count is a minor security finding. The safer pattern is to tell them how long to wait, not how many calls they've made. "Rate limit exceeded. Retry in 45s." is better than "18/20 calls used. Limit resets in 45s." — both inform the user appropriately, but the former doesn't let an adversary enumerate your headroom to time calls just under the limit.

Common mistakes

Global rate limit instead of per-session. A single global counter lets one active session crowd out all others. Always use a per-session key, even if the session concept in your framework is implicit.

Rate limiting the wrong unit. Counting calls is obvious. But some tools should be limited by compute cost rather than call count — a tool that processes a 1MB file costs 1000× more than one that processes 1KB. For cost-proportional limiting, track processing time (the elapsed milliseconds of the handler) instead of or in addition to call count.

Not testing the retry path. Write a test that calls your tool 21 times in a loop and verify that: (1) calls 1–20 succeed, (2) call 21 returns a structured error with a retry_after_seconds field, and (3) after waiting that many seconds, call 22 succeeds. Most rate limiting bugs are in the retry path, not the counting path.

Forgetting that the LLM is the caller. Human users read your error message. The LLM processes it as structured input and may loop. If your error message says "try again soon," the model may interpret "soon" as "immediately." Be explicit with numbers. Consider returning a should_retry: false field in rate limit errors where you specifically want the LLM to surface the error to the user rather than retry autonomously.

Putting it together

For a new MCP server with no existing rate limiting:

  1. Add a sliding window limiter (20 calls per minute per session) to every tool handler.
  2. Wire evict(sessionId) into your session close hook.
  3. Return structured isError: true responses for rate limit hits with a retry_after_seconds field.
  4. Run a quick audit — SkillAudit's free tier will flag any patterns the static pass catches.

That's the baseline. Once you're shipping, instrument call rates per session and per tool. If you see legitimate users consistently hitting the limit, switch to token bucket to allow burst. If you see downstream API latency spikes correlating with call volume, layer in adaptive throttling on top.

Rate limiting is one of those controls that feels like yak shaving until the day a prompt injection attack sends your tool into a loop and burns through a month of API credits in an afternoon. The implementations above are each under 50 lines. Add them before you ship.