Topic: mcp server rate limiting

MCP server rate limiting — why unbounded tool calls waste money and burn agent context

A human browsing the web notices when a page takes 30 seconds to load and hits the back button. An agent running an MCP tool doesn't. It will wait for the response, receive a 200KB HTML dump, stuff it into context, and keep going — or keep calling the same tool in a loop until it's told to stop. 12% of the 101-server SkillAudit corpus has at least one unbounded-consumption pattern: a tool that paginates without a cap, fetches without a size limit, or calls upstream APIs without a concurrency budget. The blast radius ranges from a surprising API invoice to a fully-occupied context window that renders the agent useless for the rest of the session.

TL;DR

MCP servers can perform unbounded work — pagination without a cap, file fetches without a size limit, recursive tool calls, upstream API calls without concurrency controls. In agent contexts this is worse than in human-driven UIs: agents call tools in loops without human intervention, no one reads the warning before the bill arrives, and a 200KB response fills the context window silently. Two distinct failure modes: (1) wallet-burning on metered APIs, (2) context-window DoS. The fixes are straightforward — response size caps, per-call timeouts, explicit pagination bounds, and per-session concurrency limits. SkillAudit's security axis checks for all four. This class maps to OWASP API4 (Unrestricted Resource Consumption).

How MCP servers can perform unbounded work

Most unbounded-consumption patterns in MCP servers aren't bugs in the traditional sense — they're design choices that made sense for a human-facing context and become problems in an agent context. Four common shapes:

Pagination without a cap

A tool that lists GitHub issues, search results, or database rows typically supports pagination. A server that fetches "all pages" to give the agent a complete answer can issue hundreds of sequential HTTP requests. For a large repository with 10,000 issues, that's 100+ API calls on a single tool invocation — burning API quota, taking minutes to complete, and returning far more data than the agent's context window can hold anyway.

File fetches without a size limit

A fetch tool that returns the full body of whatever URL it fetches will happily return a 50MB JavaScript bundle, a 200MB XML sitemap, or a multi-GB binary if the agent asks for a URL that resolves to one. The response goes into tool output, which goes into the agent's context, which then fills up and either truncates silently or causes an error that halts the session.

Recursive calls and callback loops

Some tool designs allow the result of one tool call to trigger another — a "crawl" tool that fetches a URL and then fetches all linked URLs, or a "process" tool that chunks a large input and calls itself recursively. Without a depth limit and a total-work budget, these patterns can execute indefinitely.

Upstream API calls without a concurrency budget

A server that calls OpenAI, Anthropic, a paid search API, or any metered service inside a tool handler expends real money on every invocation. An agent that calls the tool in a loop — because the task requires it, or because a prompt-injection caused it — can generate a surprising invoice before any human notices. Unlike a web server under DDoS, there's no rate-limit gate between the agent and the MCP server — the agent calls the tool as fast as the server handles requests.

Why agent contexts amplify the risk

In a traditional web application, unbounded resource consumption (OWASP API4) is bounded in practice by human patience: a user who initiates a request that takes 90 seconds and returns 50MB will notice, and won't repeat it. The agent equivalent has no such natural brake.

Agents call tools based on instructions and reasoning, not on observed latency or cost. An agent that's been told to "summarize all open issues in the repository" will paginate through all 10,000 issues if the tool lets it. An agent running in an automated pipeline has no human in the loop to notice that the last tool call consumed $40 of API credits. And the context-window failure mode is particularly insidious: a large response can fill available context silently, causing subsequent tool calls to produce degraded outputs or errors, and the agent may attempt to work around the failure in ways that generate more tool calls.

The result is that unbounded-consumption patterns that are merely annoying in a human-facing tool can be financially or operationally damaging in an agent context.

Two failure modes in detail

Failure mode 1 — wallet-burning on metered upstream APIs

Consider an MCP server that wraps a paid image generation API and accepts a batch_size argument. If the handler passes the agent-provided batch_size directly to the API without a cap, the agent can request 1,000 images in a single tool call. At $0.04 per image, that's $40 on one invocation. An agent in an automation loop that calls the tool hourly produces a $28,800 monthly invoice.

This pattern also applies to token-metered LLM calls, paid search APIs (Google, Bing, Serper), web scraping services, and any other per-call or per-unit-billed service. The server should enforce its own maximum — independent of what the agent requests — and document that maximum in the tool description so the agent can plan accordingly.

Failure mode 2 — context-window DoS

A tool that returns a very large response — raw HTML from a documentation site, a full JSON dump of a large API response, the complete content of a code repository — can consume a significant fraction of the agent's context window on a single tool call. Claude 3.5 Sonnet has a 200K-token context window; a 200KB text response is roughly 50K tokens, a quarter of the available context, returned in one shot.

When the context fills, the agent cannot fit its remaining instructions, prior conversation, and all tool outputs simultaneously. Clients handle this differently — some truncate silently, some return errors, some summarize automatically. None of these outcomes is what the user intended, and all of them are preventable by capping the tool's response size.

Bounded vs. unbounded fetchers — code examples

// UNBOUNDED — returns whatever the server sends, no limits
server.tool('fetch_url', { url: z.string() }, async ({ url }) => {
  const res = await fetch(url);
  const text = await res.text(); // could be 200MB
  return { content: [{ type: 'text', text }] };
});

// UNBOUNDED — paginates through all results with no cap
server.tool('list_issues', { repo: z.string() }, async ({ repo }) => {
  let page = 1, all = [];
  while (true) {
    const batch = await github.issues.list({ repo, page, per_page: 100 });
    if (batch.length === 0) break;
    all.push(...batch);
    page++;
  }
  return { content: [{ type: 'text', text: JSON.stringify(all) }] };
});

// BOUNDED — size cap, timeout, explicit pagination limit
const MAX_BYTES = 32_000;   // ~8K tokens
const TIMEOUT_MS = 10_000;  // 10 seconds
const MAX_PAGES = 5;        // at most 500 items

server.tool('fetch_url', { url: z.string().url() }, async ({ url }) => {
  const controller = new AbortController();
  const timer = setTimeout(() => controller.abort(), TIMEOUT_MS);
  try {
    const res = await fetch(url, { signal: controller.signal });
    const buffer = await res.arrayBuffer();
    const text = new TextDecoder().decode(buffer.slice(0, MAX_BYTES));
    const truncated = buffer.byteLength > MAX_BYTES;
    return { content: [{ type: 'text', text: truncated ? text + '\n[truncated]' : text }] };
  } finally {
    clearTimeout(timer);
  }
});

server.tool('list_issues',
  { repo: z.string(), page: z.number().int().min(1).max(MAX_PAGES).default(1) },
  async ({ repo, page }) => {
    const items = await github.issues.list({ repo, page, per_page: 100 });
    return { content: [{ type: 'text', text: JSON.stringify({ items, page, has_more: items.length === 100 }) }] };
  }
);

The bounded versions make the limits explicit, surface pagination state to the agent (so it can decide whether to fetch page 2), and won't run indefinitely regardless of what the agent requests. The 32KB response cap is not arbitrary — it's calibrated to return useful content while leaving the majority of the context window available for the rest of the session.

What SkillAudit checks for

The security axis includes a specific check for unrestricted resource consumption patterns, aligned with OWASP API4. SkillAudit's static analysis flags four specific patterns in tool handler code:

Unbounded response reads: res.text(), res.json(), or response.read() without a preceding size check or a Content-Length guard.
Unconditional pagination loops: while(true) or do/while loops that call an external API inside a tool handler with no iteration cap.
Agent-controlled batch sizes: tool arguments named limit, count, batch_size, or similar that are passed to downstream APIs without a server-enforced cap.
Missing timeout on outbound requests: fetch(), axios.get(), httpx.get(), and similar calls without a signal, timeout, or AbortController attached.

Each finding includes the handler name, the file and line of the problematic pattern, and a recommended fix. The 12% corpus rate for this class means roughly 1 in 8 servers you're likely to encounter has at least one of these patterns unfixed.

Check your MCP server for unbounded consumption patterns