Resilience Engineering
Designing for resiliency: how to build MCP servers that fail securely
Failure in an MCP server is not just an uptime problem — it is a security event. When an upstream API returns 503, the machine-speed LLM orchestrator doesn't wait; it retries, branches, and probes. Without deliberate failure design, partial state, ambient credentials, and exploitable retry paths become the attack surface. Circuit breakers, timeout budgets, fail-safe defaults, and clean error hygiene are not operational polish — they are security controls.
Published 2026-06-05 · 12 min read
The MCP failure model differs from conventional API failure in three ways that most resilience literature doesn't address. First, the caller is an LLM orchestrator — it can issue hundreds of parallel tool calls per second, meaning a retry storm is not a human clicking "retry" but an automated cascade that spins up before any human alert fires. Second, MCP servers typically hold ambient credentials scoped to the entire session: a database connection, a third-party API key, a filesystem mount. When the server enters an inconsistent state, those credentials are live for the full duration of the failure. Third, the LLM sees error messages as feedback that it can reason about. An uncontrolled error that leaks a stack trace, a file path, or a database schema fragment is not just noise — it is reconnaissance material returned directly to the model that controls subsequent tool calls.
Resilient failure design closes each of these gaps. It is also one of the factors SkillAudit evaluates under the Maintenance axis, where servers that lack timeout handling, circuit breaker patterns, or structured error responses receive findings. This post covers the patterns we look for — and what we flag when they are absent.
The anatomy of a failure cascade in an MCP server
Before the patterns, it helps to understand how failure propagates in an MCP context. A typical failure cascade starts with a slow upstream: the third-party API your server calls begins returning 503s or taking 30+ seconds to respond. In a conventional server, slow upstreams cause latency; under load they cause queuing; under heavy load they cause memory exhaustion. In an MCP server these still happen, but two additional effects compound the damage:
Retry amplification. The LLM orchestrator receives a timeout or error response and, per its instructions or reasoning, retries the tool call. Without a circuit breaker on the server side, each retry hits the slow upstream again. If the model runs a multi-step workflow that depends on this tool, it may issue tool calls at each branch point — the retry count multiplies with the branching factor of the task. A single failing upstream call can generate dozens of requests before any human-visible symptom appears.
Partial state exposure. If the server enters a failure mid-operation — after one write but before a compensating write — the state it exposes to subsequent calls may be inconsistent. For non-idempotent operations (credit transfers, file writes, permission grants), the inconsistent state is both a functional bug and a potential security issue: a partially applied permission grant that is exploitable, a partially completed file write that truncates a config file and breaks subsequent access controls.
The machine-speed problem. A human user clicking "try again" on a slow form generates one retry every few seconds at most — easily absorbed by any backoff. An LLM orchestrator completing a workflow step generates retries within milliseconds of receiving each error response, and a multi-agent workflow can coordinate dozens of simultaneous retry attempts. Resilience patterns that work for human-paced traffic fail catastrophically at machine speed unless they include server-side enforcement — timeout budgets, circuit breakers, queue depth limits — not just retry guidelines in the model's system prompt.
Pattern 1: Timeout budgets — every await needs a ceiling
The most common resilience finding in SkillAudit reports is an MCP tool handler that awaits an upstream call with no timeout. In Node.js, a fetch() or database query that never resolves holds the event loop handler open indefinitely. Under normal conditions this is invisible — upstreams respond quickly. Under failure conditions, a single slow call can hold resources until the process is killed or the OS closes the connection.
export async function get_user_data(
args: { userId: string }
) {
// No timeout — hangs indefinitely
// on slow upstream
const resp = await fetch(
`https://api.service.com/users/${args.userId}`,
{ headers: { Authorization: apiKey } }
);
return resp.json();
}
const TOOL_TIMEOUT_MS = 8_000;
export async function get_user_data(
args: { userId: string }
) {
const ctrl = new AbortController();
const tid = setTimeout(
() => ctrl.abort(),
TOOL_TIMEOUT_MS
);
try {
const resp = await fetch(
`https://api.service.com/users/${args.userId}`,
{
headers: { Authorization: apiKey },
signal: ctrl.signal
}
);
return resp.json();
} finally {
clearTimeout(tid);
}
}
The timeout budget principle goes further than per-call timeouts. A production-grade MCP server enforces three timeout layers:
- Per-call timeout — hard ceiling on any single upstream request (typically 5–15 seconds depending on the API).
- Per-tool-invocation budget — the total time allowed for one tool call, including retries. If a tool allows one retry, the per-invocation budget is less than 2× the per-call timeout, not open-ended.
- Per-session quota — total cumulative time or request count allowed from one session. Prevents a single runaway workflow from monopolizing upstream capacity.
Session-level quotas are especially important for tools that can be called in a loop. If the model loops a file-scanning tool across 10,000 files and each call takes 1 second, the session runs for nearly 3 hours against your upstream API before hitting any natural limit. A session quota on total tool invocations or total upstream time bounds the blast radius.
Pattern 2: Circuit breakers — stop calling broken upstreams
A timeout budget stops one call from hanging indefinitely. A circuit breaker stops the server from repeatedly calling an upstream that is clearly failing — converting retry storms into fast-fails that protect both the upstream and the server's own resources.
The circuit breaker has three states: closed (normal operation — calls pass through), open (upstream is failing — calls fast-fail immediately without hitting the upstream), and half-open (recovery probe — a single call is allowed through to test whether the upstream has recovered). The transition from closed to open happens when the failure rate exceeds a threshold over a rolling window. The transition from open to half-open happens after a cooldown period.
class CircuitBreaker {
private failures = 0;
private lastFailureAt = 0;
private state: 'closed' | 'open' | 'half-open' = 'closed';
private readonly threshold: number;
private readonly cooldownMs: number;
constructor(threshold = 5, cooldownMs = 30_000) {
this.threshold = threshold;
this.cooldownMs = cooldownMs;
}
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'open') {
const elapsed = Date.now() - this.lastFailureAt;
if (elapsed < this.cooldownMs) {
throw new Error('Service temporarily unavailable — circuit open');
}
this.state = 'half-open';
}
try {
const result = await fn();
if (this.state === 'half-open') {
this.failures = 0;
this.state = 'closed';
}
return result;
} catch (err) {
this.failures++;
this.lastFailureAt = Date.now();
if (this.failures >= this.threshold) {
this.state = 'open';
}
throw err;
}
}
}
// Usage — one breaker instance per upstream dependency
const breaker = new CircuitBreaker(5, 30_000);
export async function call_external_api(args: { query: string }) {
return breaker.call(async () => {
const ctrl = new AbortController();
const tid = setTimeout(() => ctrl.abort(), 8_000);
try {
const resp = await fetch(upstreamUrl, {
method: 'POST',
body: JSON.stringify({ q: args.query }),
signal: ctrl.signal
});
if (!resp.ok) throw new Error(`upstream ${resp.status}`);
return resp.json();
} finally {
clearTimeout(tid);
}
});
}
The circuit breaker's fast-fail response is critical. When the circuit is open, the tool returns an error immediately — without a network round-trip. The LLM orchestrator receives a clear error message that it can handle in its current reasoning context (retry later, branch to a fallback tool, report the service is down) rather than waiting 8 seconds per call while the orchestrator floods a broken upstream.
One circuit breaker per dependency, not one per server. If your MCP server calls three external services, you need three independent circuit breakers. A failure in service A should not trip the circuit for service B. Sharing a single breaker across unrelated upstreams propagates failures that should remain isolated.
Pattern 3: Exponential backoff with jitter — taming retry storms
When a tool call fails transiently (network blip, upstream restart, 429 rate limit), retrying is correct. The problem is when all retry logic uses a fixed delay or no delay at all. If a hundred LLM tool calls all fail simultaneously and all retry at the same interval, the retry wave hits the upstream at exactly the wrong moment — potentially extending the outage or triggering rate limiting that wouldn't have occurred otherwise.
Exponential backoff with full jitter distributes retries across a time window, reducing the probability that retries synchronize into a wave:
async function withRetry<T>(
fn: () => Promise<T>,
maxAttempts = 3,
baseDelayMs = 200,
maxDelayMs = 5_000
): Promise<T> {
let attempt = 0;
while (true) {
try {
return await fn();
} catch (err) {
attempt++;
if (attempt >= maxAttempts) throw err;
// Full jitter: uniform random in [0, min(cap, base * 2^attempt)]
const cap = Math.min(maxDelayMs, baseDelayMs * 2 ** attempt);
const delay = Math.random() * cap;
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}
// Combine with circuit breaker — retry inside the breaker call
export async function search_index(args: { query: string }) {
return searchBreaker.call(() =>
withRetry(() => fetchFromIndex(args.query), 3, 200, 4_000)
);
}
The key properties: the delay grows exponentially (200ms, 400ms, 800ms...) to give the upstream time to recover; the jitter randomizes within the exponential cap so simultaneous retries spread across the window rather than synchronizing; the cap prevents delay from growing unbounded; the max attempts limit prevents infinite retry loops. Compose this with the circuit breaker so that repeated failures across calls eventually trip the breaker rather than retrying indefinitely.
Pattern 4: Fail-safe defaults — what state to be in when uncertain
When an MCP server cannot determine the correct state — because an upstream is down, a response is malformed, or an authorization check fails to complete — it must choose a default posture. The fail-safe principle says: default to the more restrictive option when uncertain.
This sounds obvious but requires explicit implementation. Many handlers implicitly fail-open: they catch a timeout, log the error, and return an empty result or a cached result from a previous call. If the cached result is stale authorization data, this means a revoked permission appears active until the cache expires. If the empty result causes a downstream branch to skip a required access check, the entire access control fails open.
async function isAuthorized(
userId: string,
resource: string
): Promise<boolean> {
try {
const resp = await authBreaker.call(
() => fetchAuthz(userId, resource)
);
return resp.allowed;
} catch {
// Upstream down — return cached or default true
// so the user isn't blocked
return cachedAuthz.get(userId) ?? true;
}
}
async function isAuthorized(
userId: string,
resource: string
): Promise<boolean> {
try {
const resp = await authBreaker.call(
() => fetchAuthz(userId, resource)
);
return resp.allowed;
} catch (err) {
// Authorization service unavailable — deny
// Better to surface a temporary error than
// to grant access we can't verify
throw new McpError(
ErrorCode.InternalError,
'Authorization service temporarily unavailable'
);
}
}
The fail-closed pattern forces the tool call to surface an error that the LLM can reason about ("try again later" or "escalate to a human") rather than silently operating on stale or default-permissive data. In the case of authorization, a temporary denial is recoverable — the user gets an error and can retry when the auth service comes back. A silent grant of unauthorized access is not recoverable if the user acts on that access before the error is discovered.
The stale cache antipattern. Many servers use caching to absorb upstream failures: if the cache has a value, return it; if not, fetch and cache. Under a prolonged upstream outage (minutes to hours), this degrades gracefully — until the cache values are from before a permission revocation or role change. A user whose access was revoked continues operating with full permissions for the cache TTL duration. For authorization data specifically, prefer fail-closed over stale-cache. For non-sensitive data (public product catalog, static reference tables), stale cache is an appropriate degradation strategy.
Pattern 5: Error message hygiene — what not to return
Error messages returned by an MCP tool handler are visible to the LLM orchestrator. The model processes them, reasons about them, and may include them in subsequent tool call arguments. An error message that leaks a file path, a database connection string, a table name, or an internal service name gives the model — and any prompt injection payload it may have processed — a reconnaissance path into your infrastructure.
The principle: distinguish between internal errors (log the full detail; return a safe summary) and expected user errors (return the full reason — it's actionable and not sensitive).
class SafeError extends Error {
constructor(
public readonly code: string,
public readonly userMessage: string,
public readonly internalDetail: unknown
) {
super(userMessage);
}
}
function handleToolError(err: unknown): never {
if (err instanceof SafeError) {
// Expected error — safe to return full message
log.info({ code: err.code, detail: err.internalDetail });
throw new McpError(ErrorCode.InvalidRequest, err.userMessage);
}
// Unexpected error — log full detail internally, return safe summary
const errorId = crypto.randomUUID().slice(0, 8);
log.error({ errorId, err }, 'Unexpected tool error');
throw new McpError(
ErrorCode.InternalError,
`Internal error [ref: ${errorId}]. The operation did not complete.`
);
}
export async function read_file(args: { path: string }) {
try {
const resolved = path.join(ALLOWED_ROOT, args.path);
if (!resolved.startsWith(ALLOWED_ROOT)) {
throw new SafeError(
'PATH_TRAVERSAL',
'Path is outside the allowed directory',
{ attempted: args.path, resolved }
);
}
return await fs.readFile(resolved, 'utf8');
} catch (err) {
handleToolError(err);
}
}
The error ID pattern is particularly useful in production: the internal log captures the full stack trace and context keyed to a short reference ID, while the LLM-visible error contains only the ID. If the model includes the error ID in a user-facing summary, operations can look it up in the log. The model cannot use the error to reconstruct internal paths or schemas.
What to never include in LLM-visible error messages: file system paths (especially absolute paths), database query strings or table names, upstream API URLs with embedded credentials, process IDs or host names, dependency version strings, or stack traces. All of these are reconnaissance material.
Pattern 6: Idempotency — making retries safe
For mutating tool calls (writes, sends, payments, permission grants), retries are only safe if the operation is idempotent — producing the same result on repeat application. Without idempotency, a retry after a network timeout (where the server completed the operation but the response was lost) produces a duplicate mutation: a payment charged twice, a file written twice, a permission granted to an unintended user.
The standard approach is an idempotency key: a caller-supplied ID that the server uses to deduplicate. If the server sees a request with an ID it has already processed, it returns the cached response without re-executing the mutation.
const idempotencyStore = new Map<string, unknown>();
export async function send_notification(args: {
recipientId: string;
message: string;
idempotencyKey: string; // Required for mutating tools
}) {
// Validate idempotency key format
if (!/^[a-f0-9-]{36}$/.test(args.idempotencyKey)) {
throw new SafeError(
'INVALID_KEY',
'idempotencyKey must be a UUID v4',
{ provided: args.idempotencyKey }
);
}
const cached = idempotencyStore.get(args.idempotencyKey);
if (cached !== undefined) {
return cached; // Replay cached result — safe
}
const result = await deliverNotification(
args.recipientId,
args.message
);
// Cache before returning — store result atomically
idempotencyStore.set(args.idempotencyKey, result);
// Expire after 24 hours to bound memory usage
setTimeout(
() => idempotencyStore.delete(args.idempotencyKey),
86_400_000
);
return result;
}
The idempotency key should be generated by the LLM orchestrator or the system prompt logic that generates tool arguments — not by the MCP server itself. A server-generated key would change on each call, defeating the purpose. For LLM-initiated tool calls, the system prompt can instruct the model to generate a UUID per unique intended operation and reuse it across retries of the same operation.
Pattern 7: Graceful degradation — partial responses over total failures
When an MCP server aggregates data from multiple sources, a single failing source should not block the entire response. A tool that fetches metadata from three services and one is down should return the two successful responses plus a clear partial-failure indicator — not a total error that forces the LLM to retry the entire tool call.
export async function get_user_context(args: { userId: string }) {
const results = await Promise.allSettled([
fetchProfile(args.userId), // core — must succeed
fetchPreferences(args.userId), // supplementary
fetchActivitySummary(args.userId) // supplementary
]);
const [profileResult, prefsResult, activityResult] = results;
// Core data must succeed — fail the tool if profile is unavailable
if (profileResult.status === 'rejected') {
throw new SafeError(
'PROFILE_UNAVAILABLE',
'User profile service is temporarily unavailable',
profileResult.reason
);
}
return {
profile: profileResult.value,
preferences: prefsResult.status === 'fulfilled'
? prefsResult.value
: null,
activitySummary: activityResult.status === 'fulfilled'
? activityResult.value
: null,
_partial: [
prefsResult.status === 'rejected' && 'preferences',
activityResult.status === 'rejected' && 'activitySummary'
].filter(Boolean)
};
}
The _partial field signals to the LLM orchestrator which data fields are missing due to upstream failures. A well-designed system prompt instructs the model to note partial results in its response and suggest that the user check back. The alternative — returning a total error — forces a full retry that may also fail on the same unavailable supplementary services.
Apply the core-vs-supplementary distinction deliberately. If the tool produces no useful result without all data sources, it should fail hard. If it produces meaningful partial results, graceful degradation is appropriate — but the partial flag must be clearly communicated, not hidden in default values that look complete.
Pattern 8: Resource limits — connection pools, queue depth, and rate gates
MCP servers that hold persistent resource pools (database connections, WebSocket channels, HTTP keep-alive pools) need explicit bounds on those pools. Without bounds, a traffic spike — or an LLM orchestrator running a parallel workflow — can exhaust the pool, causing every subsequent call to queue indefinitely or fail with an opaque resource-exhaustion error.
import { Pool } from 'pg';
// Named constant — not a magic number buried in initialization
const DB_POOL_MAX = 10;
const DB_POOL_IDLE_TIMEOUT_MS = 30_000;
const DB_ACQUIRE_TIMEOUT_MS = 5_000;
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
max: DB_POOL_MAX,
idleTimeoutMillis: DB_POOL_IDLE_TIMEOUT_MS,
connectionTimeoutMillis: DB_ACQUIRE_TIMEOUT_MS
});
// Rate gate — bounded concurrency across all tool invocations
let activeCalls = 0;
const MAX_CONCURRENT = 20;
export async function db_query(args: { sql: string }) {
if (activeCalls >= MAX_CONCURRENT) {
throw new McpError(
ErrorCode.InternalError,
'Server is at capacity — retry in a moment'
);
}
activeCalls++;
try {
const client = await pool.connect();
try {
const result = await client.query(args.sql);
return result.rows;
} finally {
client.release();
}
} finally {
activeCalls--;
}
}
The connection timeout (connectionTimeoutMillis) is critical: without it, a call waiting for a pool connection will wait indefinitely if all connections are busy. The idle timeout prevents connections from accumulating against the database server's max connection limit when the MCP server is lightly loaded.
The concurrent-call gate is a secondary protection: if something upstream is causing call latency and all 10 connections are held, new calls are rejected immediately with a clear message rather than queuing and eventually timing out with no useful error. The LLM receives an actionable error ("retry in a moment") rather than a silent hang.
What SkillAudit flags
The resilience patterns above map to specific SkillAudit findings across two axes — Maintenance and Permissions — with severity levels reflecting how directly each gap creates a security risk:
- No per-call timeout on upstream fetch/query — MEDIUM, Maintenance. Unbounded await creates resource exhaustion and denial-of-service risk from machine-speed retry amplification.
- No circuit breaker on external API dependencies — MEDIUM, Maintenance. Absent circuit breakers allow retry storms to amplify upstream failures into full service outages.
- Fail-open authorization check — HIGH, Security. Catching auth service failures and defaulting to
trueor a stale-permissive cache result is a direct access-control bypass. - Stack trace or internal path in error response — MEDIUM, Security. Leaking infrastructure details to the LLM context creates a reconnaissance vector for prompt injection payloads that observe error outputs.
- No retry backoff — fixed delay or no delay — LOW, Maintenance. Synchronous retries contribute to retry storms; low severity because the primary protection is server-side circuit breaking.
- Unbounded resource pool — LOW, Maintenance. Without pool limits, a traffic spike can exhaust OS file descriptors or database server connection limits, causing process-level crashes rather than graceful degradation.
Resilience and the Maintenance axis score. A server that implements all eight patterns in this post scores clean on the Maintenance axis — 100 points before any other findings. Maintenance has a 12% weight in the overall grade, so a clean Maintenance axis contributes 12 points to the weighted total. Combined with clean Security and Credentials axes, this is typically the difference between a B+ and an A.
The resiliency checklist
Before shipping an MCP server, verify:
- Every upstream network call (
fetch, database query, cache read) has an explicit timeout viaAbortControlleror driver-level timeout option. - Each external dependency has its own circuit breaker instance (not a shared breaker across unrelated services).
- Retries use exponential backoff with full jitter, not a fixed delay or immediate retry.
- Authorization checks fail closed: a failure to reach the auth service returns an error, never a default-allow.
- Mutating tools (writes, sends, grants) accept and enforce an idempotency key.
- Error messages returned to the LLM contain no file paths, table names, stack traces, or internal URLs.
- Database connection pools have explicit
max,idleTimeoutMillis, andconnectionTimeoutMillisset. - Multi-source aggregation tools use
Promise.allSettledand distinguish core (must-succeed) from supplementary (can-degrade) data sources.
These are not heroic engineering efforts — they are table stakes for any service that holds real credentials and is called at machine speed. An MCP server that passes SkillAudit's Security and Credentials checks but fails on Maintenance is a server that is secure under normal conditions and a liability under failure conditions. Failure conditions, by definition, arrive without notice and without the human review cycle that might otherwise catch an access control gap before it is exploited.
If you want to see how your server handles failure modes before they occur in production, run a SkillAudit scan. The Maintenance axis report enumerates each missing resilience pattern, flags the specific lines where timeouts are absent, and suggests the minimal fix for each finding. A server that fails securely is a server that degrades without creating new attack surface — and that is an auditable property, not just an aspiration.
Related reading: MCP server incident response playbook — what to do in the 60 minutes after a security event is confirmed. Understanding grade drift — how a server's security posture changes over time without code changes. GitHub Action MCP security gate — running SkillAudit in CI so resilience regressions are caught before merge. MCP server security checklist — the complete pre-publish checklist across all six audit axes.