Topic: rate limiting security
MCP server rate limiting security — LLM-driven tool call amplification, per-session limits, cost-based rate limiting, denial of capability attacks
Unlike traditional APIs that rate-limit human users, MCP servers face LLM-driven tool call amplification: an LLM in an agentic loop can make hundreds of tool calls per minute without any human in the loop. A prompt injection that drives an agentic MCP loop can amplify a single malicious instruction into a sustained denial-of-capability attack or a bulk exfiltration of API quota. Five rate limiting patterns for MCP servers: per-session call budgets, cost-weighted throttling, agentic loop detection, graceful degradation, and cross-session organization limits.
1. Per-session call budgets
The token bucket algorithm is well-suited for MCP per-session rate limiting: each session receives a bucket with a configurable capacity. Each tool call drains one token. When the bucket is empty, further calls return a structured error that tells the LLM to wait before retrying. The bucket refills at a configurable rate — per minute for most tools, per hour for particularly sensitive or expensive operations.
The bucket capacity and refill rate should vary by tool sensitivity. A file-read tool that exposes no sensitive data may have a generous budget of 200 calls per minute. A bulk API call tool that fans out to a downstream service may have a much tighter budget of 10 calls per minute to protect the downstream service's quota. The budget policy is declarative — defined in a configuration object per tool — and enforced by a shared middleware layer so individual tool handlers do not need to implement their own rate limiting.
interface RateLimitPolicy {
toolName: string
costUnits: number
sessionBudgetPerMinute: number
sessionBudgetPerHour: number
}
interface TokenBucket {
minuteTokens: number
hourTokens: number
minuteResetAt: number
hourResetAt: number
}
const RATE_LIMIT_POLICIES: Record<string, RateLimitPolicy> = {
'read_file': { toolName: 'read_file', costUnits: 1, sessionBudgetPerMinute: 120, sessionBudgetPerHour: 1000 },
'search_codebase':{ toolName: 'search_codebase',costUnits: 5, sessionBudgetPerMinute: 20, sessionBudgetPerHour: 200 },
'bulk_api_call': { toolName: 'bulk_api_call', costUnits: 10, sessionBudgetPerMinute: 6, sessionBudgetPerHour: 60 },
'create_webhook': { toolName: 'create_webhook', costUnits: 20, sessionBudgetPerMinute: 3, sessionBudgetPerHour: 10 }
}
const sessionBuckets = new Map<string, Map<string, TokenBucket>>()
function getOrCreateBucket(sessionId: string, toolName: string, policy: RateLimitPolicy): TokenBucket {
if (!sessionBuckets.has(sessionId)) sessionBuckets.set(sessionId, new Map())
const toolBuckets = sessionBuckets.get(sessionId)!
if (!toolBuckets.has(toolName)) {
toolBuckets.set(toolName, {
minuteTokens: policy.sessionBudgetPerMinute,
hourTokens: policy.sessionBudgetPerHour,
minuteResetAt: Date.now() + 60_000,
hourResetAt: Date.now() + 3_600_000
})
}
return toolBuckets.get(toolName)!
}
function checkRateLimit(sessionId: string, toolName: string): {
allowed: boolean; retryAfterMs?: number; remainingMinute: number; remainingHour: number
} {
const policy = RATE_LIMIT_POLICIES[toolName]
if (!policy) return { allowed: true, remainingMinute: 999, remainingHour: 9999 }
const bucket = getOrCreateBucket(sessionId, toolName, policy)
const now = Date.now()
if (now > bucket.minuteResetAt) { bucket.minuteTokens = policy.sessionBudgetPerMinute; bucket.minuteResetAt = now + 60_000 }
if (now > bucket.hourResetAt) { bucket.hourTokens = policy.sessionBudgetPerHour; bucket.hourResetAt = now + 3_600_000 }
if (bucket.minuteTokens < policy.costUnits) {
return { allowed: false, retryAfterMs: bucket.minuteResetAt - now, remainingMinute: bucket.minuteTokens, remainingHour: bucket.hourTokens }
}
if (bucket.hourTokens < policy.costUnits) {
return { allowed: false, retryAfterMs: bucket.hourResetAt - now, remainingMinute: bucket.minuteTokens, remainingHour: bucket.hourTokens }
}
bucket.minuteTokens -= policy.costUnits
bucket.hourTokens -= policy.costUnits
return { allowed: true, remainingMinute: bucket.minuteTokens, remainingHour: bucket.hourTokens }
}
2. Cost-weighted throttling
Limiting on call count treats a read_file call identically to a bulk_api_call that fans out to 50 downstream requests. An LLM that discovers it can make 10 bulk_api_call tool calls in a burst window has exhausted the same downstream quota as 500 individual calls, while only consuming 10 rate limit tokens. Cost-weighted throttling assigns each tool a cost unit value that reflects its actual impact, then limits on cost units rather than call count.
// Cost unit definitions — reflect actual downstream API impact
const COST_UNITS: Record<string, { value: number; label: string }> = {
'get_file_content': { value: 1, label: 'file_read' },
'get_commit_message':{ value: 1, label: 'metadata_read' },
'search_codebase': { value: 5, label: 'search_fanout' },
'find_references': { value: 8, label: 'index_scan' },
'create_issue': { value: 15, label: 'write_op' },
'push_commit': { value: 25, label: 'write_op_critical' },
'bulk_label_issues': { value: 40, label: 'bulk_write' },
'export_repository': { value: 50, label: 'bulk_read_large' }
}
class CostWeightedThrottle {
private readonly SESSION_BUDGET_PER_MIN = 100
private sessionUsage = new Map<string, { units: number; resetAt: number }>()
check(sessionId: string, toolName: string): {
allowed: boolean; costUnits: number; remainingBudget: number; retryAfterSeconds?: number
} {
const cost = COST_UNITS[toolName] ?? { value: 1, label: 'unknown' }
const now = Date.now()
let usage = this.sessionUsage.get(sessionId)
if (!usage || now > usage.resetAt) {
usage = { units: 0, resetAt: now + 60_000 }
this.sessionUsage.set(sessionId, usage)
}
const remainingBudget = this.SESSION_BUDGET_PER_MIN - usage.units
if (cost.value > remainingBudget) {
return {
allowed: false,
costUnits: cost.value,
remainingBudget,
retryAfterSeconds: Math.ceil((usage.resetAt - now) / 1000)
}
}
usage.units += cost.value
return { allowed: true, costUnits: cost.value, remainingBudget: remainingBudget - cost.value }
}
}
3. Agentic loop detection
A prompt injection that drives an LLM into a tool-call loop is qualitatively different from a high-volume legitimate session. The distinguishing pattern is repeated identical calls: the same tool, the same arguments, in rapid succession. A legitimate LLM session may call read_file many times, but with different paths as it explores a codebase. A loop driven by prompt injection calls it with the same path repeatedly, because the injection is repeatedly re-injected with each tool result.
Loop detection fires a circuit breaker when the same tool-argument combination is observed N times within T seconds. The circuit breaker returns a structured error, logs the pattern for security review, and holds the session in a cooldown period before allowing further calls.
import { createHash } from 'crypto'
class AgenticLoopDetector {
private readonly LOOP_WINDOW_MS = 10_000 // 10 seconds
private readonly LOOP_THRESHOLD = 4 // 4 identical calls = loop
private readonly COOLDOWN_MS = 60_000 // 60 second cooldown
private callHistory = new Map<string, { toolName: string; timestamps: number[] }>()
private cooldowns = new Map<string, number>()
private hashArgs(args: unknown): string {
return createHash('sha256').update(JSON.stringify(args) ?? '').digest('hex').slice(0, 16)
}
check(sessionId: string, toolName: string, args: unknown): {
isLoop: boolean; loopPattern?: { tool: string; callCount: number }; inCooldown: boolean; cooldownRemainingMs?: number
} {
const now = Date.now()
const cooldownExpiry = this.cooldowns.get(sessionId)
if (cooldownExpiry && now < cooldownExpiry) {
return { isLoop: true, inCooldown: true, cooldownRemainingMs: cooldownExpiry - now }
}
const argsHash = this.hashArgs(args)
const key = `${sessionId}:${toolName}:${argsHash}`
let signal = this.callHistory.get(key)
if (!signal) { signal = { toolName, timestamps: [] }; this.callHistory.set(key, signal) }
signal.timestamps = signal.timestamps.filter(t => now - t < this.LOOP_WINDOW_MS)
signal.timestamps.push(now)
if (signal.timestamps.length >= this.LOOP_THRESHOLD) {
this.cooldowns.set(sessionId, now + this.COOLDOWN_MS)
auditLog.warn('agentic_loop_detected', {
sessionId, toolName, argsHash,
callCount: signal.timestamps.length,
severity: 'HIGH',
note: 'May indicate prompt injection driving a tool-call loop'
})
return {
isLoop: true,
loopPattern: { tool: toolName, callCount: signal.timestamps.length },
inCooldown: true,
cooldownRemainingMs: this.COOLDOWN_MS
}
}
return { isLoop: false, inCooldown: false }
}
}
4. Graceful degradation
Standard HTTP rate limiting returns a 429 status code. LLM frameworks that receive a 429 typically implement exponential backoff at the HTTP level — but in an agentic tool-call loop, the framework may retry the tool call immediately as part of the LLM reasoning step rather than honoring the 429. The result is a rapid retry flood rather than a well-behaved backoff.
MCP rate limiting must operate at the tool result layer, not the HTTP layer. When a rate limit is hit, the server returns a successful HTTP 200 with a tool result whose content signals the rate limit condition in structured form. LLMs that receive a tool result with retry guidance honor it; LLMs that receive an exception may not.
// Return structured tool result, not HTTP 429
function buildRateLimitedToolResult(params: {
toolName: string; retryAfterSeconds: number; remainingBudget: number; alternativeTools?: string[]
}): object {
const { toolName, retryAfterSeconds, remainingBudget, alternativeTools = [] } = params
return {
content: [
{
type: 'text',
text: `Rate limited: ${toolName} is temporarily unavailable. Retry after ${retryAfterSeconds}s.`
},
{
type: 'resource',
resource: {
uri: 'mcp://rate-limit-info',
mimeType: 'application/json',
text: JSON.stringify({
status: 'rate_limited',
limited_tool: toolName,
retry_after_seconds: retryAfterSeconds,
remaining_budget_units: remainingBudget,
available_tools: alternativeTools,
guidance: alternativeTools.length > 0
? `While waiting, you may use: ${alternativeTools.join(', ')}`
: 'Please pause and retry after the wait period.'
})
}
}
],
// isError: false — this is a handled condition, not an unhandled exception
// An isError: true result may cause the LLM framework to retry immediately
isError: false
}
}
// Middleware: wrap tool handlers with rate limiting
function withRateLimit(
handler: ToolHandler,
throttle: CostWeightedThrottle,
loopDetector: AgenticLoopDetector
) {
return async (request: CallToolRequest, context: RequestContext): Promise<object> => {
const loopCheck = loopDetector.check(context.sessionId, request.params.name, request.params.arguments)
if (loopCheck.isLoop) {
return buildRateLimitedToolResult({
toolName: request.params.name,
retryAfterSeconds: Math.ceil((loopCheck.cooldownRemainingMs ?? 60_000) / 1000),
remainingBudget: 0
})
}
const rateLimitCheck = throttle.check(context.sessionId, request.params.name)
if (!rateLimitCheck.allowed) {
return buildRateLimitedToolResult({
toolName: request.params.name,
retryAfterSeconds: rateLimitCheck.retryAfterSeconds ?? 60,
remainingBudget: rateLimitCheck.remainingBudget
})
}
const result = await handler(request, context)
// Augment successful result with remaining budget for LLM self-regulation
return Object.assign({}, result, {
_meta: { rate_limit: { remaining_budget_units: rateLimitCheck.remainingBudget } }
})
}
}
5. Cross-session organization limits
Per-session limits prevent a single session from consuming excessive resources. But without a cross-session layer, an attacker who controls many sessions can distribute tool calls across them to stay under per-session limits while consuming all available server capacity. Cross-session organization limits impose an aggregate cap across all sessions belonging to a single user or organization. The org-level limiter runs in addition to per-session limits — both must pass for a tool call to proceed.
interface OrgUsageRecord {
orgId: string
unitsThisMinute: number
unitsThisHour: number
activeSessionCount: number
minuteResetAt: number
hourResetAt: number
}
class OrganizationRateLimiter {
private readonly ORG_BUDGET_PER_MINUTE = 500
private readonly ORG_BUDGET_PER_HOUR = 5000
private readonly ORG_MAX_SESSIONS = 20
private orgUsage = new Map<string, OrgUsageRecord>()
private sessionOrgMap = new Map<string, string>()
registerSession(sessionId: string, orgId: string): void {
this.sessionOrgMap.set(sessionId, orgId)
this.getOrCreateRecord(orgId).activeSessionCount++
}
unregisterSession(sessionId: string): void {
const orgId = this.sessionOrgMap.get(sessionId)
if (orgId) {
const record = this.orgUsage.get(orgId)
if (record) record.activeSessionCount = Math.max(0, record.activeSessionCount - 1)
this.sessionOrgMap.delete(sessionId)
}
}
check(sessionId: string, costUnits: number): {
allowed: boolean; reason?: 'org_minute_budget' | 'org_hour_budget' | 'org_session_limit'; retryAfterSeconds?: number
} {
const orgId = this.sessionOrgMap.get(sessionId)
if (!orgId) return { allowed: true }
const now = Date.now()
const record = this.getOrCreateRecord(orgId)
if (now > record.minuteResetAt) { record.unitsThisMinute = 0; record.minuteResetAt = now + 60_000 }
if (now > record.hourResetAt) { record.unitsThisHour = 0; record.hourResetAt = now + 3_600_000 }
if (record.activeSessionCount > this.ORG_MAX_SESSIONS) return { allowed: false, reason: 'org_session_limit' }
if (record.unitsThisMinute + costUnits > this.ORG_BUDGET_PER_MINUTE) {
return { allowed: false, reason: 'org_minute_budget', retryAfterSeconds: Math.ceil((record.minuteResetAt - now) / 1000) }
}
if (record.unitsThisHour + costUnits > this.ORG_BUDGET_PER_HOUR) {
return { allowed: false, reason: 'org_hour_budget', retryAfterSeconds: Math.ceil((record.hourResetAt - now) / 1000) }
}
record.unitsThisMinute += costUnits
record.unitsThisHour += costUnits
return { allowed: true }
}
// Emergency kill switch for an organization
blockOrg(orgId: string, reason: string): void {
const record = this.getOrCreateRecord(orgId)
record.unitsThisMinute = this.ORG_BUDGET_PER_MINUTE * 10
record.minuteResetAt = Date.now() + 24 * 60 * 60 * 1000
record.unitsThisHour = this.ORG_BUDGET_PER_HOUR * 10
record.hourResetAt = Date.now() + 24 * 60 * 60 * 1000
auditLog.warn('org_blocked', { orgId, reason, durationHours: 24 })
}
private getOrCreateRecord(orgId: string): OrgUsageRecord {
if (!this.orgUsage.has(orgId)) {
const now = Date.now()
this.orgUsage.set(orgId, { orgId, unitsThisMinute: 0, unitsThisHour: 0, activeSessionCount: 0, minuteResetAt: now + 60_000, hourResetAt: now + 3_600_000 })
}
return this.orgUsage.get(orgId)!
}
}
SkillAudit checks for rate limiting security
- No per-session rate limiting: Tool handlers that make downstream API calls without a session-level rate limit — an LLM-driven agentic loop can exhaust downstream API quotas in seconds
- Rate limiting at HTTP layer only: Server returns HTTP 429 rather than a structured tool result — LLM frameworks may not honor HTTP-level rate limits in agentic contexts, leading to immediate retry floods
- No loop detection: No mechanism to detect and break repeated identical tool calls in rapid succession — prompt injection loops will run unbounded
- Uniform cost per call: Rate limiter treats a bulk operation identically to a single-resource read — cost-weighted throttling is required to account for actual downstream impact
- No organization-level limits: Per-session limits only — no aggregate cap across sessions from the same organization or user
— SkillAudit scans for these patterns automatically. Scan your MCP server.