Topic: graceful degradation and fail-secure design

MCP server graceful degradation — fail-closed authorization, circuit breakers, and error message hygiene

Failure modes are security decisions. An MCP server that grants access when its authorization service times out is fail-open — the failure itself becomes an attack vector. An MCP server that leaks stack traces to the LLM is handing an attacker a map of its internals. Graceful degradation for MCP servers is not just about uptime — it is about ensuring that every failure state is a secure state. Five patterns that make failure safe by design.

1. Fail-closed authorization — deny on error, never permit

The most dangerous failure mode in an authorization system is permitting access when the authorization check itself fails. This is called fail-open: the error path grants access. In an MCP server, fail-open authorization means a timeout, a network partition, or an exception in your permission-checking code silently promotes an unauthorized caller to an authorized one. The correct design is fail-closed: any non-affirmative result from an authorization check — including exceptions, timeouts, null returns, and unexpected shapes — results in denial. The burden of proof is on confirmation, not on denial.

// ANTI-PATTERN: fail-open authorization
async function checkPermissionUnsafe(sessionId: string, tool: string): Promise<boolean> {
  try {
    return await authService.check(sessionId, tool)
  } catch {
    return true  // ❌ Error = permit — a timeout is now a bypass
  }
}

// FAIL-CLOSED: deny on any non-affirmative result
async function checkPermissionSafe(sessionId: string, tool: string): Promise<boolean> {
  try {
    const result = await Promise.race([
      authService.check(sessionId, tool),
      new Promise<never>((_, reject) =>
        setTimeout(() => reject(new Error('auth timeout')), 2_000)
      ),
    ])
    // Only return true on explicit boolean true — not truthy
    return result === true
  } catch (err) {
    // Log the failure for ops visibility
    logger.warn({ event: 'auth_check_failed', sessionId, tool, err: String(err) })
    return false  // ✅ Any error = deny
  }
}

// In the tool handler:
server.tool('sensitive_tool', schema, async (args, { sessionId }) => {
  const allowed = await checkPermissionSafe(sessionId, 'sensitive_tool')
  if (!allowed) throw new Error('Access denied')
  // ... proceed
})

2. Promise.allSettled for multi-source tool aggregation

Many MCP tools aggregate data from multiple upstream sources — a tool might query a database, an API, and a cache to build a composite result. Using Promise.all for this pattern means a single source failure throws and returns nothing, even if the other three sources succeeded and returned useful data. Promise.allSettled waits for all promises and returns a result array with status: 'fulfilled' or status: 'rejected' for each. Partial results with explicit failure annotations are more useful to an LLM than a total failure — and they prevent a single flaky dependency from making the entire tool unavailable.

server.tool('get_project_status', { projectId: z.string().uuid() }, async ({ projectId }) => {
  const [dbResult, apiResult, cacheResult] = await Promise.allSettled([
    database.getProject(projectId),
    externalApi.fetchMetrics(projectId),
    redisCache.getRecentActivity(projectId),
  ])

  // Build response from whatever succeeded
  const response: Record<string, unknown> = {}
  const errors: string[] = []

  if (dbResult.status === 'fulfilled') {
    response.project = dbResult.value
  } else {
    errors.push('Core project data temporarily unavailable')
    logger.error({ source: 'database', projectId, err: dbResult.reason })
  }

  if (apiResult.status === 'fulfilled') {
    response.metrics = apiResult.value
  } else {
    errors.push('Metrics temporarily unavailable')
    // Non-critical — log at warn, not error
    logger.warn({ source: 'api', projectId, err: apiResult.reason })
  }

  if (cacheResult.status === 'fulfilled') {
    response.recentActivity = cacheResult.value
  }
  // Cache miss is silent — it is not an error

  return { ...response, _warnings: errors.length ? errors : undefined }
})

3. Circuit breaker fast-fail for slow or failing dependencies

Without a circuit breaker, a dependency that starts responding slowly causes tool calls to hang for the full timeout duration before failing. If the timeout is 10 seconds and the dependency stays slow, every tool call occupies a thread for 10 seconds — starving the server of resources and making the entire MCP server unresponsive. A circuit breaker tracks recent failure rates and "opens" after a threshold: subsequent calls fast-fail immediately with an error, keeping the server responsive. After a configurable recovery window, the circuit enters half-open state and allows a test request through before fully closing again.

class CircuitBreaker {
  private failures = 0
  private lastFailure = 0
  private state: 'closed' | 'open' | 'half-open' = 'closed'

  constructor(
    private readonly threshold = 5,       // failures before opening
    private readonly recoveryMs = 30_000,  // wait 30s before half-open
  ) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailure > this.recoveryMs) {
        this.state = 'half-open'
      } else {
        throw new Error('Circuit open — dependency temporarily unavailable')
      }
    }

    try {
      const result = await fn()
      // Success: reset
      this.failures = 0
      this.state = 'closed'
      return result
    } catch (err) {
      this.failures++
      this.lastFailure = Date.now()
      if (this.failures >= this.threshold) this.state = 'open'
      throw err
    }
  }
}

const dbBreaker = new CircuitBreaker(5, 30_000)

server.tool('query_records', { filter: z.string() }, async ({ filter }) => {
  return dbBreaker.call(() => database.query(filter))
})

4. Error message hygiene — no stack traces or internals to the LLM

Tool handler errors returned to the MCP caller become visible to the LLM — and through the LLM, to the user. In a prompt-injection scenario, they may also become visible to an attacker who can read the conversation context. A raw Node.js error includes a stack trace with file paths (revealing project structure), line numbers, library names and versions (enabling targeted vulnerability research), and sometimes internal variable values. The rule is simple: sanitize every error before it leaves the tool handler. Log the full error internally. Return only a safe, generic message externally.

// Map internal error types to safe user-facing messages
const ERROR_MESSAGES: Record<string, string> = {
  ECONNREFUSED:  'Service temporarily unavailable',
  ETIMEDOUT:     'Request timed out — please retry',
  ENOTFOUND:     'Service temporarily unavailable',
  UNAUTHORIZED:  'Access denied',
  FORBIDDEN:     'Access denied',
}

function sanitizeError(err: unknown): string {
  if (err instanceof Error) {
    // Check for known error codes
    const code = (err as any).code ?? (err as any).status ?? ''
    if (code && ERROR_MESSAGES[code]) return ERROR_MESSAGES[code]

    // Never return err.message — may contain file paths, SQL, credentials
    logger.error({ err: err.message, stack: err.stack })
    return 'An internal error occurred'
  }
  return 'An unexpected error occurred'
}

// Wrap every tool handler
function withErrorSanitization<T>(fn: () => Promise<T>): Promise<T> {
  return fn().catch((err) => {
    throw new Error(sanitizeError(err))
  })
}

server.tool('fetch_data', { id: z.string() }, (args) =>
  withErrorSanitization(() => internalFetch(args.id))
)

5. Read-only fallback mode when write operations fail

When a write dependency — a database write path, a mutating API endpoint, an object store — becomes unavailable, the instinct is to fail the entire tool. But read operations often continue to work: the read replica is still up, the cache is still populated, the read API endpoint still responds. Degrading to read-only preserves significant utility for the LLM: it can still retrieve state, answer questions, and inform decisions — it just cannot make changes. The LLM should be explicitly informed of the degraded state so it can communicate this to users and avoid retrying write operations that will fail.

// Health state tracked at server startup and updated by breakers
const serviceHealth = {
  reads:  true,
  writes: true,
}

// Periodically probe write health
setInterval(async () => {
  try {
    await database.writeProbe()  // lightweight health-check write
    serviceHealth.writes = true
  } catch {
    serviceHealth.writes = false
    logger.warn({ event: 'write_path_degraded' })
  }
}, 15_000)

server.tool('update_record', {
  id: z.string().uuid(),
  data: z.record(z.unknown()),
}, async ({ id, data }) => {
  // Fail fast with a clear, actionable message — not a generic error
  if (!serviceHealth.writes) {
    return {
      success: false,
      degraded: true,
      message: 'Write operations are temporarily unavailable. Read operations are unaffected. ' +
               'Current record data can still be retrieved with get_record.',
    }
  }
  return database.update(id, data)
})

How fail-secure design maps to SkillAudit sub-scores

Graceful degradation failures span SkillAudit's Security, Reliability, and Documentation sub-scores because they are simultaneously security vulnerabilities and operational issues:

For the authorization patterns that underpin fail-closed design, see MCP server access control. For the network-level patterns that interact with circuit breakers, see MCP server network segmentation.

Check your MCP server's failure mode security

SkillAudit detects fail-open authorization patterns, stack trace leakage, and missing circuit breaker configurations in your MCP server implementation.

See pricing