Topic: rate limiting and quota enforcement
MCP server rate limiting and quota enforcement — per-session quotas, token bucket burst allowance, and 429 backpressure signaling
An MCP server without rate limiting is an amplifier. A single LLM session running an agentic loop can invoke the same tool thousands of times — exhausting upstream API rate limits, generating unexpected costs, and consuming server resources without bound. Unlike HTTP API rate limiting, MCP rate limiting must account for the unique call patterns of LLM-driven tool use: bursty, session-scoped, tool-specific, and driven by reasoning that an adversarial prompt can redirect. Five patterns that enforce the right limits at the right granularity.
1. Per-session tool call quota — lifetime cap, not just throughput
Rate limiting by throughput alone (N calls per minute) does not prevent a session that runs for hours from making an unlimited total number of tool calls. An agentic loop that calls a tool once every two minutes can make 720 calls in a 24-hour session — well under a per-minute rate limit, but potentially exhausting an upstream API's daily quota. Per-session quotas set a hard lifetime ceiling on tool calls for a given session, regardless of how slowly those calls arrive. The quota should be tracked per-session in server memory (or Redis for multi-process deployments) and rejected with a clear message when exhausted.
interface SessionQuota {
totalCalls: number
callsByTool: Record<string, number>
sessionStartMs: number
}
const SESSION_LIMITS = {
totalCalls: 500, // Max calls across the entire session lifetime
perTool: 100, // Max calls to any single tool per session
sessionMaxMs: 3_600_000, // Session hard expiry: 1 hour
}
const sessionQuotas = new Map<string, SessionQuota>()
function checkSessionQuota(sessionId: string, toolName: string): void {
const now = Date.now()
const quota = sessionQuotas.get(sessionId) ?? {
totalCalls: 0, callsByTool: {}, sessionStartMs: now,
}
// Check session age
if (now - quota.sessionStartMs > SESSION_LIMITS.sessionMaxMs) {
sessionQuotas.delete(sessionId)
throw new Error('Session quota: session lifetime exceeded (1 hour). Start a new session.')
}
// Check total call limit
if (quota.totalCalls >= SESSION_LIMITS.totalCalls) {
throw new Error(`Session quota: ${SESSION_LIMITS.totalCalls} total tool calls exceeded for this session.`)
}
// Check per-tool limit
const toolCalls = quota.callsByTool[toolName] ?? 0
if (toolCalls >= SESSION_LIMITS.perTool) {
throw new Error(`Session quota: ${SESSION_LIMITS.perTool} calls to '${toolName}' exceeded for this session.`)
}
// Increment counters
quota.totalCalls++
quota.callsByTool[toolName] = toolCalls + 1
sessionQuotas.set(sessionId, quota)
}
2. Per-tool time budget to limit expensive tool consumption
Tool calls vary enormously in cost: a simple data lookup may take 20 milliseconds, while a code execution tool might run for 30 seconds and consume GPU or CPU resources proportionally. A session that exclusively calls the expensive tool would not trigger a call-count quota quickly, but would consume vastly disproportionate resources. Per-tool time budgets track the aggregate wall-clock time spent on each tool type across a session. When the budget is exhausted, further calls to that tool are rejected. This prevents any single tool from being abused as a resource-exhaustion vector while leaving cheaper tools available.
// Time budgets per tool (milliseconds per session)
const TOOL_TIME_BUDGETS_MS: Record<string, number> = {
'execute_code': 120_000, // 2 minutes aggregate per session
'run_search': 60_000, // 1 minute aggregate
'fetch_url': 30_000, // 30 seconds aggregate
// Default for unlisted tools: 60 seconds
}
const DEFAULT_BUDGET_MS = 60_000
const sessionTimeBudgets = new Map<string, Map<string, number>>()
async function withTimeBudget<T>(
sessionId: string,
toolName: string,
fn: () => Promise<T>
): Promise<T> {
const budgets = sessionTimeBudgets.get(sessionId) ?? new Map()
const spent = budgets.get(toolName) ?? 0
const budget = TOOL_TIME_BUDGETS_MS[toolName] ?? DEFAULT_BUDGET_MS
if (spent >= budget) {
const budgetSec = (budget / 1000).toFixed(0)
throw new Error(
`Time budget exhausted: ${toolName} has used ${budgetSec}s of allowed ${budgetSec}s this session.`
)
}
const start = Date.now()
try {
return await fn()
} finally {
const elapsed = Date.now() - start
budgets.set(toolName, spent + elapsed)
sessionTimeBudgets.set(sessionId, budgets)
}
}
3. Sliding window rate limiter to prevent burst exploitation
A fixed-window rate limiter resets its counter at the start of each window (e.g., every 60 seconds). This creates an exploitable boundary: an attacker can make N calls at second 59, watch the window reset, then make N more calls at second 61 — totaling 2N calls in 2 seconds without triggering the limit. A sliding window tracks the timestamps of actual calls within a rolling time window, rather than a count within a fixed period. The window slides with each new call, so the effective rate is always measured against the most recent 60 seconds of actual history, regardless of where the window boundaries fall.
class SlidingWindowRateLimiter {
// Map of key -> array of call timestamps (ms)
private windows = new Map<string, number[]>()
constructor(
private readonly windowMs: number, // e.g., 60_000 for 1-minute window
private readonly maxCalls: number, // e.g., 60 for 60 calls/minute
) {}
check(key: string): { allowed: boolean; retryAfterMs: number } {
const now = Date.now()
const cutoff = now - this.windowMs
const calls = (this.windows.get(key) ?? []).filter(t => t > cutoff)
if (calls.length >= this.maxCalls) {
// Earliest call in window determines when a slot opens up
const oldestCall = Math.min(...calls)
const retryAfterMs = oldestCall + this.windowMs - now
return { allowed: false, retryAfterMs }
}
calls.push(now)
this.windows.set(key, calls)
return { allowed: true, retryAfterMs: 0 }
}
}
// Per-session, per-tool sliding window: 30 calls per minute per tool
const rateLimiter = new SlidingWindowRateLimiter(60_000, 30)
function checkRateLimit(sessionId: string, toolName: string): void {
const key = `${sessionId}:${toolName}`
const { allowed, retryAfterMs } = rateLimiter.check(key)
if (!allowed) {
const retrySec = Math.ceil(retryAfterMs / 1000)
throw Object.assign(
new Error(`Rate limit exceeded for ${toolName}. Retry in ${retrySec}s.`),
{ code: 429, retryAfterSeconds: retrySec }
)
}
}
4. 429 backpressure with Retry-After for structured LLM signaling
When an MCP server rejects a tool call due to rate limiting, the error message the LLM receives determines its next action. A generic "Error: rate limit" message with no timing information causes the LLM to retry immediately, generating an exponential retry storm that worsens the rate limit situation. A structured error that includes a specific retry time — modeled on HTTP 429 Too Many Requests with a Retry-After header — tells the LLM exactly when the limit will clear and allows it to either wait and retry, or inform the user of the delay. This is backpressure: the server communicates its capacity constraints in a way the client can act on.
// Structured rate limit error that the LLM can parse and act on
class RateLimitError extends Error {
readonly code = 429
readonly retryAfterSeconds: number
readonly limitType: 'per_minute' | 'per_session' | 'time_budget'
constructor(opts: {
message: string
retryAfterSeconds: number
limitType: RateLimitError['limitType']
}) {
super(opts.message)
this.name = 'RateLimitError'
this.retryAfterSeconds = opts.retryAfterSeconds
this.limitType = opts.limitType
}
}
// In the MCP tool handler — convert to structured MCP error response
server.tool('search_code', { query: z.string() }, async (args, { sessionId }) => {
try {
checkRateLimit(sessionId, 'search_code')
checkSessionQuota(sessionId, 'search_code')
return await performSearch(args.query)
} catch (err) {
if (err instanceof RateLimitError) {
// Return a structured error the LLM understands
throw new Error(
`[RATE_LIMIT] ${err.message} ` +
`Retry in ${err.retryAfterSeconds} seconds. ` +
`Limit type: ${err.limitType}. ` +
`Do not retry until the specified time has passed.`
)
}
throw err
}
})
5. Token bucket algorithm for burst allowance with average rate enforcement
Strict per-second rate limits penalize legitimate tool use patterns where an LLM naturally calls several tools in quick succession at the start of a task. A token bucket accommodates this burst behavior: tokens accumulate in a bucket at a steady refill rate (e.g., 1 token/second), and each tool call consumes one token. The bucket has a maximum capacity (e.g., 10 tokens), so accumulated tokens cannot grow unbounded. This allows a burst of up to 10 calls at any moment (consuming all accumulated tokens), while the long-run average is constrained to 1 call/second by the refill rate. Legitimate agentic sessions — which burst at task start then slow down — pass cleanly. Runaway loops that call continuously do not.
class TokenBucket {
private tokens: number
private lastRefillMs: number
constructor(
private readonly capacity: number, // Max tokens (burst size)
private readonly refillRatePerSec: number, // Tokens added per second
) {
this.tokens = capacity // Start full
this.lastRefillMs = Date.now()
}
consume(count = 1): { allowed: boolean; retryAfterSeconds: number } {
this.refill()
if (this.tokens < count) {
// Time until enough tokens refill
const needed = count - this.tokens
const retryAfterSeconds = needed / this.refillRatePerSec
return { allowed: false, retryAfterSeconds }
}
this.tokens -= count
return { allowed: true, retryAfterSeconds: 0 }
}
private refill(): void {
const now = Date.now()
const elapsedSec = (now - this.lastRefillMs) / 1000
this.tokens = Math.min(this.capacity, this.tokens + elapsedSec * this.refillRatePerSec)
this.lastRefillMs = now
}
}
// Per-session buckets: 10 token burst capacity, 1 token/second refill
const sessionBuckets = new Map<string, TokenBucket>()
function getSessionBucket(sessionId: string): TokenBucket {
if (!sessionBuckets.has(sessionId)) {
// 10-token burst, 1 call/second average
sessionBuckets.set(sessionId, new TokenBucket(10, 1))
}
return sessionBuckets.get(sessionId)!
}
// Use in middleware applied to all tool calls
function checkTokenBucket(sessionId: string, toolName: string): void {
const { allowed, retryAfterSeconds } = getSessionBucket(sessionId).consume()
if (!allowed) {
throw new RateLimitError({
message: `Rate limit: token bucket exhausted for session.`,
retryAfterSeconds: Math.ceil(retryAfterSeconds),
limitType: 'per_minute',
})
}
}
How rate limiting maps to SkillAudit sub-scores
Absent rate limiting affects SkillAudit's Security and Reliability sub-scores because runaway tool call patterns are both an abuse vector and an operational failure:
- Security sub-score: Missing rate limiting on tools that make outbound API calls generates a Security finding — an attacker who can control LLM behavior (via prompt injection) can use unlimited tool calls to exfiltrate data, exhaust upstream APIs, or trigger cost-based denial of service against the MCP server operator.
- Reliability sub-score: No per-session quotas, no time budgets, and no burst limiting are flagged as reliability issues. An MCP server without these controls is vulnerable to accidental runaway loops — not just malicious ones.
- Documentation sub-score: Rate limits and quotas should be documented in the MCP server's tool descriptions so the LLM can make informed decisions about tool usage frequency. Undocumented rate limits generate Documentation findings.
For the authorization controls that should sit alongside rate limiting, see MCP server access control. For graceful degradation when upstream rate limits are hit (rather than server-side limits), see MCP server graceful degradation security.
Audit your MCP server's rate limiting posture
SkillAudit flags missing session quotas, absent token bucket rate limiting, and undocumented rate limit behavior in MCP server tool implementations.
See pricing