Topic: rate limiting security

MCP server rate limiting security — LLM-driven tool call amplification, per-session limits, cost-based rate limiting, denial of capability attacks

Unlike traditional APIs that rate-limit human users, MCP servers face LLM-driven tool call amplification: an LLM in an agentic loop can make hundreds of tool calls per minute without any human in the loop. A prompt injection that drives an agentic MCP loop can amplify a single malicious instruction into a sustained denial-of-capability attack or a bulk exfiltration of API quota. Five rate limiting patterns for MCP servers: per-session call budgets, cost-weighted throttling, agentic loop detection, graceful degradation, and cross-session organization limits.

Quick reference

LLMs ignore HTTP 429, but they honor tool result errors: When an MCP server returns an HTTP 429, the LLM framework typically retries immediately. When the server returns a structured tool result with a rate_limited field and a retry_after value, a well-behaved LLM framework respects the guidance. Always rate-limit at the tool result layer, not the HTTP layer.
Not all tool calls cost the same: A read_file call that reads 100 bytes costs far less to your downstream API quota than a search_codebase that fans out to 50 sub-requests. Define a cost unit per tool and limit on cost units, not call count.
Rapid same-argument repetition signals a loop: An LLM that calls list_repos with the same arguments five times in ten seconds is in a loop. A prompt injection driving that loop will not stop unless the server fires a circuit breaker.
Per-session limits protect the session; org limits protect the server: Both are needed. A per-session limit prevents a single session from self-DOS-ing. An org-level limit prevents a single organization from consuming all server capacity.
Include remaining budget in every tool response: An LLM that can see its remaining call budget can self-regulate. An LLM that cannot see it will keep calling until blocked.

1. Per-session call budgets

The token bucket algorithm is well-suited for MCP per-session rate limiting: each session receives a bucket with a configurable capacity. Each tool call drains one token. When the bucket is empty, further calls return a structured error that tells the LLM to wait before retrying. The bucket refills at a configurable rate — per minute for most tools, per hour for particularly sensitive or expensive operations.

The bucket capacity and refill rate should vary by tool sensitivity. A file-read tool that exposes no sensitive data may have a generous budget of 200 calls per minute. A bulk API call tool that fans out to a downstream service may have a much tighter budget of 10 calls per minute to protect the downstream service's quota. The budget policy is declarative — defined in a configuration object per tool — and enforced by a shared middleware layer so individual tool handlers do not need to implement their own rate limiting.

interface RateLimitPolicy {
  toolName: string
  costUnits: number
  sessionBudgetPerMinute: number
  sessionBudgetPerHour: number
}

interface TokenBucket {
  minuteTokens: number
  hourTokens: number
  minuteResetAt: number
  hourResetAt: number
}

const RATE_LIMIT_POLICIES: Record<string, RateLimitPolicy> = {
  'read_file':      { toolName: 'read_file',      costUnits: 1,  sessionBudgetPerMinute: 120, sessionBudgetPerHour: 1000 },
  'search_codebase':{ toolName: 'search_codebase',costUnits: 5,  sessionBudgetPerMinute: 20,  sessionBudgetPerHour: 200  },
  'bulk_api_call':  { toolName: 'bulk_api_call',  costUnits: 10, sessionBudgetPerMinute: 6,   sessionBudgetPerHour: 60   },
  'create_webhook': { toolName: 'create_webhook', costUnits: 20, sessionBudgetPerMinute: 3,   sessionBudgetPerHour: 10   }
}

const sessionBuckets = new Map<string, Map<string, TokenBucket>>()

function getOrCreateBucket(sessionId: string, toolName: string, policy: RateLimitPolicy): TokenBucket {
  if (!sessionBuckets.has(sessionId)) sessionBuckets.set(sessionId, new Map())
  const toolBuckets = sessionBuckets.get(sessionId)!
  if (!toolBuckets.has(toolName)) {
    toolBuckets.set(toolName, {
      minuteTokens: policy.sessionBudgetPerMinute,
      hourTokens: policy.sessionBudgetPerHour,
      minuteResetAt: Date.now() + 60_000,
      hourResetAt: Date.now() + 3_600_000
    })
  }
  return toolBuckets.get(toolName)!
}

function checkRateLimit(sessionId: string, toolName: string): {
  allowed: boolean; retryAfterMs?: number; remainingMinute: number; remainingHour: number
} {
  const policy = RATE_LIMIT_POLICIES[toolName]
  if (!policy) return { allowed: true, remainingMinute: 999, remainingHour: 9999 }

  const bucket = getOrCreateBucket(sessionId, toolName, policy)
  const now = Date.now()

  if (now > bucket.minuteResetAt) { bucket.minuteTokens = policy.sessionBudgetPerMinute; bucket.minuteResetAt = now + 60_000 }
  if (now > bucket.hourResetAt)   { bucket.hourTokens = policy.sessionBudgetPerHour;     bucket.hourResetAt = now + 3_600_000 }

  if (bucket.minuteTokens < policy.costUnits) {
    return { allowed: false, retryAfterMs: bucket.minuteResetAt - now, remainingMinute: bucket.minuteTokens, remainingHour: bucket.hourTokens }
  }
  if (bucket.hourTokens < policy.costUnits) {
    return { allowed: false, retryAfterMs: bucket.hourResetAt - now, remainingMinute: bucket.minuteTokens, remainingHour: bucket.hourTokens }
  }

  bucket.minuteTokens -= policy.costUnits
  bucket.hourTokens -= policy.costUnits
  return { allowed: true, remainingMinute: bucket.minuteTokens, remainingHour: bucket.hourTokens }
}

2. Cost-weighted throttling

Limiting on call count treats a read_file call identically to a bulk_api_call that fans out to 50 downstream requests. An LLM that discovers it can make 10 bulk_api_call tool calls in a burst window has exhausted the same downstream quota as 500 individual calls, while only consuming 10 rate limit tokens. Cost-weighted throttling assigns each tool a cost unit value that reflects its actual impact, then limits on cost units rather than call count.

// Cost unit definitions — reflect actual downstream API impact
const COST_UNITS: Record<string, { value: number; label: string }> = {
  'get_file_content':  { value: 1,  label: 'file_read' },
  'get_commit_message':{ value: 1,  label: 'metadata_read' },
  'search_codebase':   { value: 5,  label: 'search_fanout' },
  'find_references':   { value: 8,  label: 'index_scan' },
  'create_issue':      { value: 15, label: 'write_op' },
  'push_commit':       { value: 25, label: 'write_op_critical' },
  'bulk_label_issues': { value: 40, label: 'bulk_write' },
  'export_repository': { value: 50, label: 'bulk_read_large' }
}

class CostWeightedThrottle {
  private readonly SESSION_BUDGET_PER_MIN = 100
  private sessionUsage = new Map<string, { units: number; resetAt: number }>()

  check(sessionId: string, toolName: string): {
    allowed: boolean; costUnits: number; remainingBudget: number; retryAfterSeconds?: number
  } {
    const cost = COST_UNITS[toolName] ?? { value: 1, label: 'unknown' }
    const now = Date.now()

    let usage = this.sessionUsage.get(sessionId)
    if (!usage || now > usage.resetAt) {
      usage = { units: 0, resetAt: now + 60_000 }
      this.sessionUsage.set(sessionId, usage)
    }

    const remainingBudget = this.SESSION_BUDGET_PER_MIN - usage.units

    if (cost.value > remainingBudget) {
      return {
        allowed: false,
        costUnits: cost.value,
        remainingBudget,
        retryAfterSeconds: Math.ceil((usage.resetAt - now) / 1000)
      }
    }

    usage.units += cost.value
    return { allowed: true, costUnits: cost.value, remainingBudget: remainingBudget - cost.value }
  }
}

3. Agentic loop detection

A prompt injection that drives an LLM into a tool-call loop is qualitatively different from a high-volume legitimate session. The distinguishing pattern is repeated identical calls: the same tool, the same arguments, in rapid succession. A legitimate LLM session may call read_file many times, but with different paths as it explores a codebase. A loop driven by prompt injection calls it with the same path repeatedly, because the injection is repeatedly re-injected with each tool result.

Loop detection fires a circuit breaker when the same tool-argument combination is observed N times within T seconds. The circuit breaker returns a structured error, logs the pattern for security review, and holds the session in a cooldown period before allowing further calls.

import { createHash } from 'crypto'

class AgenticLoopDetector {
  private readonly LOOP_WINDOW_MS = 10_000   // 10 seconds
  private readonly LOOP_THRESHOLD = 4         // 4 identical calls = loop
  private readonly COOLDOWN_MS = 60_000       // 60 second cooldown

  private callHistory = new Map<string, { toolName: string; timestamps: number[] }>()
  private cooldowns = new Map<string, number>()

  private hashArgs(args: unknown): string {
    return createHash('sha256').update(JSON.stringify(args) ?? '').digest('hex').slice(0, 16)
  }

  check(sessionId: string, toolName: string, args: unknown): {
    isLoop: boolean; loopPattern?: { tool: string; callCount: number }; inCooldown: boolean; cooldownRemainingMs?: number
  } {
    const now = Date.now()
    const cooldownExpiry = this.cooldowns.get(sessionId)
    if (cooldownExpiry && now < cooldownExpiry) {
      return { isLoop: true, inCooldown: true, cooldownRemainingMs: cooldownExpiry - now }
    }

    const argsHash = this.hashArgs(args)
    const key = `${sessionId}:${toolName}:${argsHash}`
    let signal = this.callHistory.get(key)
    if (!signal) { signal = { toolName, timestamps: [] }; this.callHistory.set(key, signal) }

    signal.timestamps = signal.timestamps.filter(t => now - t < this.LOOP_WINDOW_MS)
    signal.timestamps.push(now)

    if (signal.timestamps.length >= this.LOOP_THRESHOLD) {
      this.cooldowns.set(sessionId, now + this.COOLDOWN_MS)
      auditLog.warn('agentic_loop_detected', {
        sessionId, toolName, argsHash,
        callCount: signal.timestamps.length,
        severity: 'HIGH',
        note: 'May indicate prompt injection driving a tool-call loop'
      })
      return {
        isLoop: true,
        loopPattern: { tool: toolName, callCount: signal.timestamps.length },
        inCooldown: true,
        cooldownRemainingMs: this.COOLDOWN_MS
      }
    }

    return { isLoop: false, inCooldown: false }
  }
}

4. Graceful degradation

Standard HTTP rate limiting returns a 429 status code. LLM frameworks that receive a 429 typically implement exponential backoff at the HTTP level — but in an agentic tool-call loop, the framework may retry the tool call immediately as part of the LLM reasoning step rather than honoring the 429. The result is a rapid retry flood rather than a well-behaved backoff.

MCP rate limiting must operate at the tool result layer, not the HTTP layer. When a rate limit is hit, the server returns a successful HTTP 200 with a tool result whose content signals the rate limit condition in structured form. LLMs that receive a tool result with retry guidance honor it; LLMs that receive an exception may not.

// Return structured tool result, not HTTP 429
function buildRateLimitedToolResult(params: {
  toolName: string; retryAfterSeconds: number; remainingBudget: number; alternativeTools?: string[]
}): object {
  const { toolName, retryAfterSeconds, remainingBudget, alternativeTools = [] } = params

  return {
    content: [
      {
        type: 'text',
        text: `Rate limited: ${toolName} is temporarily unavailable. Retry after ${retryAfterSeconds}s.`
      },
      {
        type: 'resource',
        resource: {
          uri: 'mcp://rate-limit-info',
          mimeType: 'application/json',
          text: JSON.stringify({
            status: 'rate_limited',
            limited_tool: toolName,
            retry_after_seconds: retryAfterSeconds,
            remaining_budget_units: remainingBudget,
            available_tools: alternativeTools,
            guidance: alternativeTools.length > 0
              ? `While waiting, you may use: ${alternativeTools.join(', ')}`
              : 'Please pause and retry after the wait period.'
          })
        }
      }
    ],
    // isError: false — this is a handled condition, not an unhandled exception
    // An isError: true result may cause the LLM framework to retry immediately
    isError: false
  }
}

// Middleware: wrap tool handlers with rate limiting
function withRateLimit(
  handler: ToolHandler,
  throttle: CostWeightedThrottle,
  loopDetector: AgenticLoopDetector
) {
  return async (request: CallToolRequest, context: RequestContext): Promise<object> => {
    const loopCheck = loopDetector.check(context.sessionId, request.params.name, request.params.arguments)
    if (loopCheck.isLoop) {
      return buildRateLimitedToolResult({
        toolName: request.params.name,
        retryAfterSeconds: Math.ceil((loopCheck.cooldownRemainingMs ?? 60_000) / 1000),
        remainingBudget: 0
      })
    }

    const rateLimitCheck = throttle.check(context.sessionId, request.params.name)
    if (!rateLimitCheck.allowed) {
      return buildRateLimitedToolResult({
        toolName: request.params.name,
        retryAfterSeconds: rateLimitCheck.retryAfterSeconds ?? 60,
        remainingBudget: rateLimitCheck.remainingBudget
      })
    }

    const result = await handler(request, context)
    // Augment successful result with remaining budget for LLM self-regulation
    return Object.assign({}, result, {
      _meta: { rate_limit: { remaining_budget_units: rateLimitCheck.remainingBudget } }
    })
  }
}

5. Cross-session organization limits

Per-session limits prevent a single session from consuming excessive resources. But without a cross-session layer, an attacker who controls many sessions can distribute tool calls across them to stay under per-session limits while consuming all available server capacity. Cross-session organization limits impose an aggregate cap across all sessions belonging to a single user or organization. The org-level limiter runs in addition to per-session limits — both must pass for a tool call to proceed.

interface OrgUsageRecord {
  orgId: string
  unitsThisMinute: number
  unitsThisHour: number
  activeSessionCount: number
  minuteResetAt: number
  hourResetAt: number
}

class OrganizationRateLimiter {
  private readonly ORG_BUDGET_PER_MINUTE = 500
  private readonly ORG_BUDGET_PER_HOUR = 5000
  private readonly ORG_MAX_SESSIONS = 20

  private orgUsage = new Map<string, OrgUsageRecord>()
  private sessionOrgMap = new Map<string, string>()

  registerSession(sessionId: string, orgId: string): void {
    this.sessionOrgMap.set(sessionId, orgId)
    this.getOrCreateRecord(orgId).activeSessionCount++
  }

  unregisterSession(sessionId: string): void {
    const orgId = this.sessionOrgMap.get(sessionId)
    if (orgId) {
      const record = this.orgUsage.get(orgId)
      if (record) record.activeSessionCount = Math.max(0, record.activeSessionCount - 1)
      this.sessionOrgMap.delete(sessionId)
    }
  }

  check(sessionId: string, costUnits: number): {
    allowed: boolean; reason?: 'org_minute_budget' | 'org_hour_budget' | 'org_session_limit'; retryAfterSeconds?: number
  } {
    const orgId = this.sessionOrgMap.get(sessionId)
    if (!orgId) return { allowed: true }

    const now = Date.now()
    const record = this.getOrCreateRecord(orgId)

    if (now > record.minuteResetAt) { record.unitsThisMinute = 0; record.minuteResetAt = now + 60_000 }
    if (now > record.hourResetAt)   { record.unitsThisHour = 0;   record.hourResetAt   = now + 3_600_000 }

    if (record.activeSessionCount > this.ORG_MAX_SESSIONS) return { allowed: false, reason: 'org_session_limit' }
    if (record.unitsThisMinute + costUnits > this.ORG_BUDGET_PER_MINUTE) {
      return { allowed: false, reason: 'org_minute_budget', retryAfterSeconds: Math.ceil((record.minuteResetAt - now) / 1000) }
    }
    if (record.unitsThisHour + costUnits > this.ORG_BUDGET_PER_HOUR) {
      return { allowed: false, reason: 'org_hour_budget', retryAfterSeconds: Math.ceil((record.hourResetAt - now) / 1000) }
    }

    record.unitsThisMinute += costUnits
    record.unitsThisHour += costUnits
    return { allowed: true }
  }

  // Emergency kill switch for an organization
  blockOrg(orgId: string, reason: string): void {
    const record = this.getOrCreateRecord(orgId)
    record.unitsThisMinute = this.ORG_BUDGET_PER_MINUTE * 10
    record.minuteResetAt = Date.now() + 24 * 60 * 60 * 1000
    record.unitsThisHour = this.ORG_BUDGET_PER_HOUR * 10
    record.hourResetAt = Date.now() + 24 * 60 * 60 * 1000
    auditLog.warn('org_blocked', { orgId, reason, durationHours: 24 })
  }

  private getOrCreateRecord(orgId: string): OrgUsageRecord {
    if (!this.orgUsage.has(orgId)) {
      const now = Date.now()
      this.orgUsage.set(orgId, { orgId, unitsThisMinute: 0, unitsThisHour: 0, activeSessionCount: 0, minuteResetAt: now + 60_000, hourResetAt: now + 3_600_000 })
    }
    return this.orgUsage.get(orgId)!
  }
}

SkillAudit checks for rate limiting security

No per-session rate limiting: Tool handlers that make downstream API calls without a session-level rate limit — an LLM-driven agentic loop can exhaust downstream API quotas in seconds
Rate limiting at HTTP layer only: Server returns HTTP 429 rather than a structured tool result — LLM frameworks may not honor HTTP-level rate limits in agentic contexts, leading to immediate retry floods
No loop detection: No mechanism to detect and break repeated identical tool calls in rapid succession — prompt injection loops will run unbounded
Uniform cost per call: Rate limiter treats a bulk operation identically to a single-resource read — cost-weighted throttling is required to account for actual downstream impact
No organization-level limits: Per-session limits only — no aggregate cap across sessions from the same organization or user

— SkillAudit scans for these patterns automatically. Scan your MCP server.