Topic: mcp server rate limit bypass

MCP server rate limit bypass — preventing resource exhaustion and API cost abuse

Most MCP servers have no rate limiting at the tool-call level. For a stdio server used by one developer on one machine, that's acceptable — the LLM's own token budget limits how many tool calls can happen per session. For an HTTP-transport server accessed by multiple clients, or any server that proxies to paid external APIs, the absence of rate limits creates three distinct risks: resource exhaustion from prompt-injection-driven tool flooding, API cost abuse where the server owner pays for attacker-generated traffic, and compute DoS that makes the server unavailable to legitimate callers. This page covers the four bypass patterns and the per-caller sliding-window implementation that closes them.

The four rate limit bypass patterns

Pattern 1: IP-level rate limiting without per-user limits

The simplest rate limit implementation counts requests per IP address. This correctly limits a single-IP attacker but fails in every realistic MCP deployment scenario: multiple agent clients behind a NAT share one IP; a single LLM making tool calls in a loop is a single IP; an HTTP-transport server behind a reverse proxy sees only the proxy's IP.

// WRONG: IP-only rate limit — bypassed by NAT, proxy, and single-client floods
import rateLimit from 'express-rate-limit';

const limiter = rateLimit({
  windowMs: 60_000,
  max: 100,
  keyGenerator: (req) => req.ip  // all clients behind NAT share one bucket
});

// CORRECT: rate limit by authenticated caller identity, not IP
const limiter = rateLimit({
  windowMs: 60_000,
  max: 60,  // 60 tool calls per minute per caller
  keyGenerator: (req) => {
    // Use the authenticated user or API key as the rate limit key:
    return req.user?.id || req.headers['x-api-key'] || req.ip;
  },
  handler: (req, res) => {
    res.status(429).json({
      error: 'rate_limit_exceeded',
      message: 'Tool call rate limit reached. Retry-After header indicates when limit resets.',
      limit: 60,
      windowMs: 60_000
    });
  },
  standardHeaders: true,  // sends RateLimit-* headers per RFC 6585
  legacyHeaders: false,
});

Pattern 2: fixed window with boundary-exploit

A fixed window rate limit (e.g., 60 requests per minute, reset at :00 each minute) allows up to 120 requests in a 2-second window straddling the boundary: 60 requests at 11:59:59 and 60 requests at 12:00:01. A burst attacker can hit 2× the nominal limit by timing requests around the window boundary.

// Sliding window rate limit — eliminates the boundary exploit
class SlidingWindowRateLimiter {
  private windows = new Map<string, number[]>();

  isAllowed(key: string, limit: number, windowMs: number): boolean {
    const now = Date.now();
    const windowStart = now - windowMs;

    if (!this.windows.has(key)) this.windows.set(key, []);
    const timestamps = this.windows.get(key)!;

    // Remove timestamps outside the window:
    const recent = timestamps.filter(t => t > windowStart);
    this.windows.set(key, recent);

    if (recent.length >= limit) return false;

    recent.push(now);
    return true;
  }

  getRetryAfter(key: string, windowMs: number): number {
    const timestamps = this.windows.get(key) || [];
    if (!timestamps.length) return 0;
    const oldest = Math.min(...timestamps);
    return Math.ceil((oldest + windowMs - Date.now()) / 1000);
  }
}

const rateLimiter = new SlidingWindowRateLimiter();

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const callerId = getCallerId(request);  // from auth context
  if (!rateLimiter.isAllowed(callerId, 60, 60_000)) {
    const retryAfter = rateLimiter.getRetryAfter(callerId, 60_000);
    throw new McpError(
      ErrorCode.InvalidRequest,
      `Rate limit exceeded. Retry after ${retryAfter} seconds.`
    );
  }
  // ... tool handling
});

Pattern 3: global rate limit without per-tool limits

A global rate limit that counts all tool calls equally allows an attacker to flood a single expensive tool while the global counter fills. If search_web costs $0.02/call and get_server_status costs nothing, a 60-calls/minute limit that applies to both equally allows $1.20/minute in search API costs from a single client.

// Per-tool rate limits with different thresholds for expensive tools
const TOOL_LIMITS: Record<string, { limit: number; windowMs: number }> = {
  'search_web':         { limit: 10,  windowMs: 60_000 },  // 10/min — costs money
  'run_code':           { limit: 5,   windowMs: 60_000 },  // 5/min — compute intensive
  'read_file':          { limit: 100, windowMs: 60_000 },  // 100/min — cheap
  'get_server_status':  { limit: 300, windowMs: 60_000 },  // 300/min — trivial
};

const DEFAULT_LIMIT = { limit: 60, windowMs: 60_000 };

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const toolName = request.params.name;
  const { limit, windowMs } = TOOL_LIMITS[toolName] ?? DEFAULT_LIMIT;
  const callerId = getCallerId(request);
  const key = `${callerId}:${toolName}`;

  if (!rateLimiter.isAllowed(key, limit, windowMs)) {
    throw new McpError(
      ErrorCode.InvalidRequest,
      `Rate limit for tool '${toolName}' exceeded (${limit}/${windowMs}ms).`
    );
  }
  // ...
});

Pattern 4: missing backpressure on concurrent calls

Rate limits count calls over time. Concurrency limits cap how many calls run simultaneously. Both are needed: without a concurrency limit, a client can queue 60 calls in the first second of a new window, all running in parallel, saturating the server's outbound connection pool or the external API's concurrent request limit.

import PQueue from 'p-queue';

// Per-caller concurrency queue — maximum 3 concurrent calls per caller
const callerQueues = new Map<string, PQueue>();

function getCallerQueue(callerId: string): PQueue {
  if (!callerQueues.has(callerId)) {
    callerQueues.set(callerId, new PQueue({ concurrency: 3 }));
  }
  return callerQueues.get(callerId)!;
}

// Global concurrency limit across all callers:
const globalQueue = new PQueue({ concurrency: 20 });

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const callerId = getCallerId(request);
  const queue = getCallerQueue(callerId);

  // Reject immediately if per-caller queue is backed up:
  if (queue.size >= 10) {
    throw new McpError(
      ErrorCode.InvalidRequest,
      'Too many concurrent requests. Wait for pending calls to complete.'
    );
  }

  return queue.add(() => globalQueue.add(() => handleToolCall(request)));
});

Prompt-injection-driven flooding: the MCP-specific threat

The rate limit bypass patterns above exist in any API server. What is unique to MCP servers is the prompt injection vector. A malicious document processed by the agent — a README with hidden HTML comment injection, a GitHub issue with invisible Unicode, a PDF with injected system-prompt text — can cause the LLM to make tool calls in a loop without any human authorization.

Per-caller sliding-window limits are the correct defense because they cap the blast radius per session: even if prompt injection causes the agent to make 1,000 search_web calls, only 10 execute per minute per session. The loop runs, but the damage is bounded. Without rate limits, the loop runs until the context window is exhausted and every search call executes — potentially running up hundreds of dollars in API costs before the user notices.

What SkillAudit checks

The security axis flags rate limiting gaps in HTTP-transport MCP servers. Specifically:

stdio-transport single-user servers receive PASS for this check — the user's session context provides natural rate limiting.

See also

Check your server for rate limiting gaps before it costs you.

Run a free audit → How grading works →