Topic: rate limiting and quota enforcement

MCP server rate limiting and quota enforcement — per-session quotas, token bucket burst allowance, and 429 backpressure signaling

An MCP server without rate limiting is an amplifier. A single LLM session running an agentic loop can invoke the same tool thousands of times — exhausting upstream API rate limits, generating unexpected costs, and consuming server resources without bound. Unlike HTTP API rate limiting, MCP rate limiting must account for the unique call patterns of LLM-driven tool use: bursty, session-scoped, tool-specific, and driven by reasoning that an adversarial prompt can redirect. Five patterns that enforce the right limits at the right granularity.

1. Per-session tool call quota — lifetime cap, not just throughput

Rate limiting by throughput alone (N calls per minute) does not prevent a session that runs for hours from making an unlimited total number of tool calls. An agentic loop that calls a tool once every two minutes can make 720 calls in a 24-hour session — well under a per-minute rate limit, but potentially exhausting an upstream API's daily quota. Per-session quotas set a hard lifetime ceiling on tool calls for a given session, regardless of how slowly those calls arrive. The quota should be tracked per-session in server memory (or Redis for multi-process deployments) and rejected with a clear message when exhausted.

interface SessionQuota {
  totalCalls: number
  callsByTool: Record<string, number>
  sessionStartMs: number
}

const SESSION_LIMITS = {
  totalCalls:    500,    // Max calls across the entire session lifetime
  perTool:       100,    // Max calls to any single tool per session
  sessionMaxMs:  3_600_000,  // Session hard expiry: 1 hour
}

const sessionQuotas = new Map<string, SessionQuota>()

function checkSessionQuota(sessionId: string, toolName: string): void {
  const now = Date.now()
  const quota = sessionQuotas.get(sessionId) ?? {
    totalCalls: 0, callsByTool: {}, sessionStartMs: now,
  }

  // Check session age
  if (now - quota.sessionStartMs > SESSION_LIMITS.sessionMaxMs) {
    sessionQuotas.delete(sessionId)
    throw new Error('Session quota: session lifetime exceeded (1 hour). Start a new session.')
  }

  // Check total call limit
  if (quota.totalCalls >= SESSION_LIMITS.totalCalls) {
    throw new Error(`Session quota: ${SESSION_LIMITS.totalCalls} total tool calls exceeded for this session.`)
  }

  // Check per-tool limit
  const toolCalls = quota.callsByTool[toolName] ?? 0
  if (toolCalls >= SESSION_LIMITS.perTool) {
    throw new Error(`Session quota: ${SESSION_LIMITS.perTool} calls to '${toolName}' exceeded for this session.`)
  }

  // Increment counters
  quota.totalCalls++
  quota.callsByTool[toolName] = toolCalls + 1
  sessionQuotas.set(sessionId, quota)
}

2. Per-tool time budget to limit expensive tool consumption

Tool calls vary enormously in cost: a simple data lookup may take 20 milliseconds, while a code execution tool might run for 30 seconds and consume GPU or CPU resources proportionally. A session that exclusively calls the expensive tool would not trigger a call-count quota quickly, but would consume vastly disproportionate resources. Per-tool time budgets track the aggregate wall-clock time spent on each tool type across a session. When the budget is exhausted, further calls to that tool are rejected. This prevents any single tool from being abused as a resource-exhaustion vector while leaving cheaper tools available.

// Time budgets per tool (milliseconds per session)
const TOOL_TIME_BUDGETS_MS: Record<string, number> = {
  'execute_code':   120_000,  // 2 minutes aggregate per session
  'run_search':      60_000,  // 1 minute aggregate
  'fetch_url':       30_000,  // 30 seconds aggregate
  // Default for unlisted tools: 60 seconds
}
const DEFAULT_BUDGET_MS = 60_000

const sessionTimeBudgets = new Map<string, Map<string, number>>()

async function withTimeBudget<T>(
  sessionId: string,
  toolName: string,
  fn: () => Promise<T>
): Promise<T> {
  const budgets = sessionTimeBudgets.get(sessionId) ?? new Map()
  const spent = budgets.get(toolName) ?? 0
  const budget = TOOL_TIME_BUDGETS_MS[toolName] ?? DEFAULT_BUDGET_MS

  if (spent >= budget) {
    const budgetSec = (budget / 1000).toFixed(0)
    throw new Error(
      `Time budget exhausted: ${toolName} has used ${budgetSec}s of allowed ${budgetSec}s this session.`
    )
  }

  const start = Date.now()
  try {
    return await fn()
  } finally {
    const elapsed = Date.now() - start
    budgets.set(toolName, spent + elapsed)
    sessionTimeBudgets.set(sessionId, budgets)
  }
}

3. Sliding window rate limiter to prevent burst exploitation

A fixed-window rate limiter resets its counter at the start of each window (e.g., every 60 seconds). This creates an exploitable boundary: an attacker can make N calls at second 59, watch the window reset, then make N more calls at second 61 — totaling 2N calls in 2 seconds without triggering the limit. A sliding window tracks the timestamps of actual calls within a rolling time window, rather than a count within a fixed period. The window slides with each new call, so the effective rate is always measured against the most recent 60 seconds of actual history, regardless of where the window boundaries fall.

class SlidingWindowRateLimiter {
  // Map of key -> array of call timestamps (ms)
  private windows = new Map<string, number[]>()

  constructor(
    private readonly windowMs: number,  // e.g., 60_000 for 1-minute window
    private readonly maxCalls: number,  // e.g., 60 for 60 calls/minute
  ) {}

  check(key: string): { allowed: boolean; retryAfterMs: number } {
    const now = Date.now()
    const cutoff = now - this.windowMs
    const calls = (this.windows.get(key) ?? []).filter(t => t > cutoff)

    if (calls.length >= this.maxCalls) {
      // Earliest call in window determines when a slot opens up
      const oldestCall = Math.min(...calls)
      const retryAfterMs = oldestCall + this.windowMs - now
      return { allowed: false, retryAfterMs }
    }

    calls.push(now)
    this.windows.set(key, calls)
    return { allowed: true, retryAfterMs: 0 }
  }
}

// Per-session, per-tool sliding window: 30 calls per minute per tool
const rateLimiter = new SlidingWindowRateLimiter(60_000, 30)

function checkRateLimit(sessionId: string, toolName: string): void {
  const key = `${sessionId}:${toolName}`
  const { allowed, retryAfterMs } = rateLimiter.check(key)
  if (!allowed) {
    const retrySec = Math.ceil(retryAfterMs / 1000)
    throw Object.assign(
      new Error(`Rate limit exceeded for ${toolName}. Retry in ${retrySec}s.`),
      { code: 429, retryAfterSeconds: retrySec }
    )
  }
}

4. 429 backpressure with Retry-After for structured LLM signaling

When an MCP server rejects a tool call due to rate limiting, the error message the LLM receives determines its next action. A generic "Error: rate limit" message with no timing information causes the LLM to retry immediately, generating an exponential retry storm that worsens the rate limit situation. A structured error that includes a specific retry time — modeled on HTTP 429 Too Many Requests with a Retry-After header — tells the LLM exactly when the limit will clear and allows it to either wait and retry, or inform the user of the delay. This is backpressure: the server communicates its capacity constraints in a way the client can act on.

// Structured rate limit error that the LLM can parse and act on
class RateLimitError extends Error {
  readonly code = 429
  readonly retryAfterSeconds: number
  readonly limitType: 'per_minute' | 'per_session' | 'time_budget'

  constructor(opts: {
    message: string
    retryAfterSeconds: number
    limitType: RateLimitError['limitType']
  }) {
    super(opts.message)
    this.name = 'RateLimitError'
    this.retryAfterSeconds = opts.retryAfterSeconds
    this.limitType = opts.limitType
  }
}

// In the MCP tool handler — convert to structured MCP error response
server.tool('search_code', { query: z.string() }, async (args, { sessionId }) => {
  try {
    checkRateLimit(sessionId, 'search_code')
    checkSessionQuota(sessionId, 'search_code')
    return await performSearch(args.query)
  } catch (err) {
    if (err instanceof RateLimitError) {
      // Return a structured error the LLM understands
      throw new Error(
        `[RATE_LIMIT] ${err.message} ` +
        `Retry in ${err.retryAfterSeconds} seconds. ` +
        `Limit type: ${err.limitType}. ` +
        `Do not retry until the specified time has passed.`
      )
    }
    throw err
  }
})

5. Token bucket algorithm for burst allowance with average rate enforcement

Strict per-second rate limits penalize legitimate tool use patterns where an LLM naturally calls several tools in quick succession at the start of a task. A token bucket accommodates this burst behavior: tokens accumulate in a bucket at a steady refill rate (e.g., 1 token/second), and each tool call consumes one token. The bucket has a maximum capacity (e.g., 10 tokens), so accumulated tokens cannot grow unbounded. This allows a burst of up to 10 calls at any moment (consuming all accumulated tokens), while the long-run average is constrained to 1 call/second by the refill rate. Legitimate agentic sessions — which burst at task start then slow down — pass cleanly. Runaway loops that call continuously do not.

class TokenBucket {
  private tokens: number
  private lastRefillMs: number

  constructor(
    private readonly capacity: number,    // Max tokens (burst size)
    private readonly refillRatePerSec: number,  // Tokens added per second
  ) {
    this.tokens = capacity  // Start full
    this.lastRefillMs = Date.now()
  }

  consume(count = 1): { allowed: boolean; retryAfterSeconds: number } {
    this.refill()

    if (this.tokens < count) {
      // Time until enough tokens refill
      const needed = count - this.tokens
      const retryAfterSeconds = needed / this.refillRatePerSec
      return { allowed: false, retryAfterSeconds }
    }

    this.tokens -= count
    return { allowed: true, retryAfterSeconds: 0 }
  }

  private refill(): void {
    const now = Date.now()
    const elapsedSec = (now - this.lastRefillMs) / 1000
    this.tokens = Math.min(this.capacity, this.tokens + elapsedSec * this.refillRatePerSec)
    this.lastRefillMs = now
  }
}

// Per-session buckets: 10 token burst capacity, 1 token/second refill
const sessionBuckets = new Map<string, TokenBucket>()

function getSessionBucket(sessionId: string): TokenBucket {
  if (!sessionBuckets.has(sessionId)) {
    // 10-token burst, 1 call/second average
    sessionBuckets.set(sessionId, new TokenBucket(10, 1))
  }
  return sessionBuckets.get(sessionId)!
}

// Use in middleware applied to all tool calls
function checkTokenBucket(sessionId: string, toolName: string): void {
  const { allowed, retryAfterSeconds } = getSessionBucket(sessionId).consume()
  if (!allowed) {
    throw new RateLimitError({
      message: `Rate limit: token bucket exhausted for session.`,
      retryAfterSeconds: Math.ceil(retryAfterSeconds),
      limitType: 'per_minute',
    })
  }
}

How rate limiting maps to SkillAudit sub-scores

Absent rate limiting affects SkillAudit's Security and Reliability sub-scores because runaway tool call patterns are both an abuse vector and an operational failure:

For the authorization controls that should sit alongside rate limiting, see MCP server access control. For graceful degradation when upstream rate limits are hit (rather than server-side limits), see MCP server graceful degradation security.

Audit your MCP server's rate limiting posture

SkillAudit flags missing session quotas, absent token bucket rate limiting, and undocumented rate limit behavior in MCP server tool implementations.

See pricing