Topic: token budget security

MCP server token budget security — context window exhaustion, response padding, tool loop inflation

Every LLM session has a finite context window. Tool responses from MCP servers are inserted verbatim into that context window. A tool that returns a 50,000-word document, a full repository listing, or a raw database dump does not just cost money — it displaces earlier context including system prompts, safety instructions, and conversation history that the model needs to behave correctly. Adversarial documents exploit this deliberately: padding content that causes the tool response to be large enough to push out system instructions, effectively removing safety constraints from the model's context. This page covers five token budget security patterns: response size limits with truncation notices, content summarization before return, per-session token budget accounting, recursive tool call depth limits, and adversarial padding pattern detection.

Quick reference

Hard response size cap: Truncate tool responses at a configurable character/token limit before returning to the LLM. Always include a truncation notice: "Result truncated at 10,000 characters. Use pagination to retrieve more."
Return structured data, not prose: A tool that returns a list of 200 issues should return an array of {id, title, state} objects, not the full issue body. Prose is for humans; the LLM needs structured data to reason about, not the full text.
Token accounting: Estimate tokens consumed per tool call (characters / 4 as a rough approximation). Track cumulative tokens per session. Warn on approach to context limit; stop accepting tool calls when within 20% of the limit.
Depth limits: Block tool calls initiated by a tool response (recursive tool use). Limit tool call depth to prevent runaway agent loops from exhausting the budget in seconds.
Padding detection: Scan tool response text for repetitive content (high character-to-unique-character ratios), unusual Unicode repetition, or zero-width characters — signals of adversarial context inflation.

1. Response size limits with truncation notices

The simplest and most effective token budget protection is a hard limit on the size of every tool response. The limit must be set at the tool handler level, not at the MCP transport level — the transport carries raw bytes, not semantic content, so it cannot distinguish between a useful 10,000-character response and an adversarial 100,000-character padding attack. The handler knows the tool's semantic purpose and can set a contextually appropriate limit.

Equally important: the truncation notice. A tool that silently truncates its response leaves the LLM with partial data and no indication that the response was cut. The LLM may interpret the truncated response as complete, leading to incorrect reasoning. A truncation notice tells the LLM the result was limited and provides a mechanism to retrieve more: "Showing 1-50 of 847 items. Call list_issues with page=2 to continue."

const RESPONSE_CHAR_LIMIT = 10_000 // ~2,500 tokens at 4 chars/token
const RESPONSE_ITEM_LIMIT = 50     // max items in array responses

// DANGEROUS: returning full content without size limit
async function getDocumentDangerous(docId: string): Promise<object> {
  const doc = await db.query('SELECT * FROM documents WHERE id = $1', [docId])
  return doc.rows[0] // may return a 500,000-character document body
}

// SAFE: truncate with notice
async function getDocumentSafe(docId: string): Promise<object> {
  const doc = await db.query('SELECT id, title, created_at, author, body FROM documents WHERE id = $1', [docId])
  const row = doc.rows[0]
  if (!row) return { error: 'Document not found' }

  const body: string = row.body ?? ''
  const truncated = body.length > RESPONSE_CHAR_LIMIT
  return {
    id: row.id,
    title: row.title,
    created_at: row.created_at,
    author: row.author,
    body: truncated ? body.slice(0, RESPONSE_CHAR_LIMIT) : body,
    ...(truncated && {
      truncated: true,
      truncation_notice: `Document body truncated at ${RESPONSE_CHAR_LIMIT} characters (full length: ${body.length}). Call get_document_section to retrieve specific sections.`,
      full_length: body.length,
    }),
  }
}

// For list responses: paginate + limit
async function listIssuesSafe(
  repo: string,
  page = 1,
): Promise<object> {
  const offset = (page - 1) * RESPONSE_ITEM_LIMIT
  const result = await db.query(
    'SELECT id, title, state, created_at FROM issues WHERE repo = $1 ORDER BY created_at DESC LIMIT $2 OFFSET $3',
    [repo, RESPONSE_ITEM_LIMIT + 1, offset], // fetch one extra to detect if there is a next page
  )
  const hasMore = result.rows.length > RESPONSE_ITEM_LIMIT
  const items = result.rows.slice(0, RESPONSE_ITEM_LIMIT)
  return {
    items,
    page,
    has_more: hasMore,
    ...(hasMore && { next_page_notice: `Showing page ${page} of results. Call list_issues with page=${page + 1} for more.` }),
  }
}

2. Content summarization before return

For tools that inherently return large content — document retrieval, code file reading, web page fetching — the alternative to truncation is summarization: the MCP server pre-processes the content before returning it to the LLM, reducing token consumption while preserving the information the LLM needs to complete its task. This is particularly valuable for document bodies and web page content, where the full text is rarely needed — the LLM typically needs to answer a question about the document, not have the entire text in context.

// For code files: return structure summary instead of full content
async function readCodeFileSafe(path: string, task: string): Promise<object> {
  const content = await readFileSafe(path) // uses the safe FD-closing version

  // Return different levels of detail based on task
  if (content.length < 2000) {
    // Small files: return full content
    return { path, content, size: content.length }
  }

  // Large files: return structure summary + the first section
  const lines = content.split('\n')
  const exportedNames = lines
    .filter(l => l.match(/^export\s+(function|class|const|async function|interface|type)\s+(\w+)/))
    .map(l => l.match(/^export\s+\w+\s+(?:function\s+)?(\w+)/)?.[1])
    .filter(Boolean)

  const functionBodies = lines
    .filter(l => l.match(/^(export\s+)?(async\s+)?function\s+\w+|^(export\s+)?class\s+\w+/))
    .slice(0, 10) // first 10 function/class declarations

  return {
    path,
    size: content.length,
    line_count: lines.length,
    exported_symbols: exportedNames,
    top_declarations: functionBodies,
    preview: lines.slice(0, 50).join('\n'), // first 50 lines
    summary: `File contains ${lines.length} lines with ${exportedNames.length} exported symbols. Use get_function_body to retrieve specific function implementations.`,
  }
}

// For web pages: strip HTML, extract key content only
function extractPageContent(html: string, maxChars = 5000): string {
  // Remove scripts, styles, and HTML tags
  const textContent = html
    .replace(/<script[\s\S]*?<\/script>/gi, '')
    .replace(/<style[\s\S]*?<\/style>/gi, '')
    .replace(/<[^>]+>/g, ' ')
    .replace(/\s{2,}/g, ' ')
    .trim()

  if (textContent.length <= maxChars) return textContent

  // Return start + end — context about page structure and CTA
  const start = textContent.slice(0, maxChars * 0.7)
  const end = textContent.slice(-maxChars * 0.3)
  return `${start}\n\n[... ${textContent.length - maxChars} characters omitted ...]\n\n${end}`
}

3. Per-session token budget accounting

Individual response size limits protect against a single large response. Per-session token accounting protects against cumulative exhaustion: many small responses that collectively fill the context window. A session that has consumed 80% of the context window through legitimate tool calls is also a session where subsequent tool responses displace system prompt and early conversation context — the LLM's anchoring instructions are gone, replaced by recent tool output.

const CONTEXT_WINDOW_TOKENS = 200_000 // Claude's context window
const WARNING_THRESHOLD = 0.75         // warn at 75% consumed
const HARD_LIMIT_THRESHOLD = 0.90      // stop at 90% — preserve 20k tokens for reasoning

// Rough token estimate: 1 token ≈ 4 characters (English prose)
function estimateTokens(text: string): number {
  return Math.ceil(text.length / 4)
}

interface SessionBudget {
  sessionId: string
  tokensConsumed: number
  toolCallCount: number
}

const sessionBudgets = new Map<string, SessionBudget>()

function getOrCreateBudget(sessionId: string): SessionBudget {
  if (!sessionBudgets.has(sessionId)) {
    sessionBudgets.set(sessionId, { sessionId, tokensConsumed: 0, toolCallCount: 0 })
  }
  return sessionBudgets.get(sessionId)!
}

function trackAndCheckBudget(sessionId: string, responseText: string): void {
  const budget = getOrCreateBudget(sessionId)
  const responseTokens = estimateTokens(responseText)
  budget.tokensConsumed += responseTokens
  budget.toolCallCount++

  const utilizationRatio = budget.tokensConsumed / CONTEXT_WINDOW_TOKENS

  if (utilizationRatio >= HARD_LIMIT_THRESHOLD) {
    throw new Error(
      `Session token budget exhausted (${budget.tokensConsumed.toLocaleString()} of ${CONTEXT_WINDOW_TOKENS.toLocaleString()} estimated tokens used). ` +
      `Start a new session to continue. This session has made ${budget.toolCallCount} tool calls.`
    )
  }

  if (utilizationRatio >= WARNING_THRESHOLD) {
    // Append a budget warning to the response — the LLM can act on this
    // (We append rather than prepend to avoid disrupting the beginning of the response)
    console.warn({ event: 'token_budget_warning', sessionId, utilized: utilizationRatio.toFixed(2) })
  }
}

// Wrapper that applies the budget check to any tool handler
function withBudgetTracking<T extends object>(
  sessionId: string,
  handler: () => Promise<T>,
): Promise<T> {
  return handler().then(result => {
    trackAndCheckBudget(sessionId, JSON.stringify(result))
    return result
  })
}

4. Recursive tool call depth limits

In agentic workflows, a tool call's response can include instructions or data that causes the LLM to make additional tool calls, which in turn return data that causes further calls. Without a depth limit, this recursive pattern can inflate token consumption geometrically: each level of recursion adds its full response to the context. An adversarial document that returns content structured to look like tool invocation instructions can trigger dozens of additional calls before the session context is exhausted.

const MAX_TOOL_CALL_DEPTH = 5       // maximum recursive tool call depth per session
const MAX_TOOL_CALLS_PER_MINUTE = 30 // rate limit on top of depth limit

interface ToolCallContext {
  sessionId: string
  depth: number
  callsThisMinute: number
  minuteWindowStart: number
}

const toolCallContexts = new Map<string, ToolCallContext>()

function checkCallDepthAndRate(sessionId: string, currentDepth: number): void {
  const ctx = toolCallContexts.get(sessionId) ?? {
    sessionId, depth: 0, callsThisMinute: 0, minuteWindowStart: Date.now(),
  }

  // Reset per-minute counter if window has passed
  if (Date.now() - ctx.minuteWindowStart > 60_000) {
    ctx.callsThisMinute = 0
    ctx.minuteWindowStart = Date.now()
  }

  if (currentDepth > MAX_TOOL_CALL_DEPTH) {
    throw new Error(
      `Tool call depth limit exceeded (${currentDepth} > ${MAX_TOOL_CALL_DEPTH}). ` +
      `Recursive tool invocation chains are limited to prevent context window exhaustion.`
    )
  }

  ctx.callsThisMinute++
  if (ctx.callsThisMinute > MAX_TOOL_CALLS_PER_MINUTE) {
    throw new Error(
      `Tool call rate limit exceeded (${ctx.callsThisMinute} calls in the last 60 seconds). ` +
      `Maximum is ${MAX_TOOL_CALLS_PER_MINUTE} tool calls per minute.`
    )
  }

  toolCallContexts.set(sessionId, ctx)
}

// The MCP transport layer should track call depth via the request context
// (depth increments when a tool call is initiated from within a tool response)
// If your MCP SDK does not provide depth, track via a session-scoped counter

5. Adversarial padding pattern detection

Adversarial context window attacks pad content with high-token-count but low-information material to displace earlier context. The padding may consist of repeated phrases, whitespace, zero-width Unicode characters, or invisible control characters that consume tokens without appearing suspicious to a human reading the document. Detecting these patterns before including content in a tool response prevents a document-based attack from consuming the LLM's context budget.

// Detect common adversarial padding patterns in content before returning to LLM

interface PaddingCheck {
  suspicious: boolean
  reason?: string
  detectedAt: number // character offset
}

function detectAdversarialPadding(content: string): PaddingCheck {
  // 1. Zero-width and invisible Unicode characters (common in prompt injection payloads)
  const invisibleChars = /[-‍⁠͏]/g
  const invisibleMatch = invisibleChars.exec(content)
  if (invisibleMatch) {
    return { suspicious: true, reason: 'Zero-width or invisible Unicode characters detected', detectedAt: invisibleMatch.index }
  }

  // 2. High repetition ratio — repeated phrase padding
  // Calculate unique character ratio: low uniqueness = high repetition
  const uniqueChars = new Set(content).size
  const uniqueRatio = uniqueChars / Math.min(content.length, 1000)
  if (content.length > 500 && uniqueRatio < 0.05) {
    return { suspicious: true, reason: `Low unique character ratio (${uniqueRatio.toFixed(3)}) suggests repetitive padding`, detectedAt: 0 }
  }

  // 3. Repeated substring detection — same 20-char+ sequence appearing 5+ times
  const WINDOW = 20
  const REPEAT_THRESHOLD = 5
  for (let i = 0; i < Math.min(content.length - WINDOW, 5000); i += WINDOW) {
    const chunk = content.slice(i, i + WINDOW)
    let count = 0
    let pos = 0
    while ((pos = content.indexOf(chunk, pos)) !== -1) {
      count++
      pos += WINDOW
      if (count >= REPEAT_THRESHOLD) {
        return { suspicious: true, reason: `Repeated ${WINDOW}-character sequence detected ${count}+ times — possible padding attack`, detectedAt: i }
      }
    }
  }

  // 4. Unusually long lines (single-line content that would fill the context window)
  const maxLineLength = Math.max(...content.split('\n').map(l => l.length))
  if (maxLineLength > 5000) {
    return { suspicious: true, reason: `Unusually long line (${maxLineLength} chars) — may be single-line context inflation`, detectedAt: content.indexOf(content.split('\n').find(l => l.length > 5000)!) }
  }

  return { suspicious: false, detectedAt: -1 }
}

// Apply in any tool that returns third-party content
async function fetchAndReturnContent(url: string): Promise<object> {
  const response = await fetch(url)
  const text = await response.text()
  const limited = text.slice(0, RESPONSE_CHAR_LIMIT) // always limit first

  const paddingCheck = detectAdversarialPadding(limited)
  if (paddingCheck.suspicious) {
    return {
      content: limited,
      security_notice: `Content contains a potential adversarial padding pattern (${paddingCheck.reason} at offset ${paddingCheck.detectedAt}). Content has been included but treat with caution.`,
    }
  }

  return { content: limited }
}

What SkillAudit checks

SkillAudit's analysis flags these token budget security issues:

Unbounded tool responses — handlers that return full database rows, raw HTTP response bodies, or full file contents without a size cap. Flagged when the response includes fields with no size constraint (e.g., body, content, text) from external sources.
Raw third-party content in responses — tool handlers that return response.text(), row.body, or document content verbatim without truncation or summarization. These are both token budget risks and prompt injection vectors.
No pagination in list responses — handlers that return SELECT * FROM table without a LIMIT clause, or array responses with no item count cap.
No tool call rate limit — handlers without a per-session counter, allowing LLM-driven tool call loops to exhaust the token budget without triggering a circuit breaker.
Array spread of DB results into response — return result.rows directly, returning all rows including body/description columns that may be unbounded in size.

Run a free SkillAudit scan to check your MCP server's token budget exposure. Oversized tool responses and prompt injection through large content are both covered in the Security sub-score.