Topic: mcp server circuit breaker security

MCP server circuit breaker security — fail-open vs fail-closed cascades

Circuit breakers are typically discussed as a reliability pattern — they prevent a failing downstream dependency from taking down the whole service. In MCP servers they're also a security control. When a circuit opens (downstream is failing), the server must decide: fail open (allow requests through without the downstream check) or fail closed (block all requests until the downstream recovers). That decision determines whether a targeted attack on your auth provider, rate limit store, or SSRF blocklist service can be used to disable your security controls entirely.

Why circuit breaker state matters for security

Consider an MCP server that validates API keys against a remote auth service. If the auth service is the only place key validity is checked, an attacker who can cause the auth service to become unreachable (DDoS, DNS poisoning, network partition) controls whether your server accepts all requests or no requests — depending on how the circuit is configured.

A fail-open circuit in front of an auth check is a DoS-to-authentication-bypass: the attacker doesn't need to forge a valid key, just take down the key validation service. This is a real attack vector for MCP servers deployed in production, especially those using shared cloud auth infrastructure.

Fail-closed circuit breaker implementation

// Circuit breaker with fail-closed for security-critical dependencies
const STATES = { CLOSED: 'closed', OPEN: 'open', HALF_OPEN: 'half_open' };

class SecurityCircuitBreaker {
  constructor(opts = {}) {
    this.state = STATES.CLOSED;
    this.failureCount = 0;
    this.failureThreshold = opts.failureThreshold ?? 5;
    this.recoveryTimeout = opts.recoveryTimeout ?? 30_000; // 30s before half-open probe
    this.openedAt = null;
    this.failClosed = opts.failClosed ?? true; // SECURITY-CRITICAL SERVICES: default true
    // Cache last known good response for fail-closed with stale data
    this.lastGoodResponse = null;
    this.maxStaleMs = opts.maxStaleMs ?? 300_000; // 5-minute stale window
  }

  async call(fn) {
    if (this.state === STATES.OPEN) {
      const elapsed = Date.now() - this.openedAt;
      if (elapsed < this.recoveryTimeout) {
        return this._handleOpen();
      }
      // Transition to half-open: allow one probe
      this.state = STATES.HALF_OPEN;
    }

    try {
      const result = await fn();
      this._onSuccess(result);
      return result;
    } catch (err) {
      return this._onFailure(err);
    }
  }

  _onSuccess(result) {
    this.failureCount = 0;
    this.state = STATES.CLOSED;
    this.lastGoodResponse = { result, cachedAt: Date.now() };
  }

  _onFailure(err) {
    this.failureCount++;
    if (this.failureCount >= this.failureThreshold) {
      this.state = STATES.OPEN;
      this.openedAt = Date.now();
      process.stderr.write(JSON.stringify({
        event: 'CIRCUIT_OPENED',
        failClosed: this.failClosed,
        ts: new Date().toISOString(),
      }) + '\n');
    }
    return this._handleOpen();
  }

  _handleOpen() {
    if (!this.failClosed) {
      // Non-security dependency: allow through with degraded behavior
      return { degraded: true };
    }

    // Security-critical: try stale cache first
    if (this.lastGoodResponse) {
      const staleMs = Date.now() - this.lastGoodResponse.cachedAt;
      if (staleMs < this.maxStaleMs) {
        // Return stale result (e.g., cached auth key list)
        return { ...this.lastGoodResponse.result, fromStaleCache: true };
      }
    }

    // No usable cache — deny the request
    throw new Error('Security dependency unavailable — request denied');
  }
}

// Usage: auth service gets fail-closed circuit
const authCircuit = new SecurityCircuitBreaker({ failClosed: true, maxStaleMs: 300_000 });

async function verifyToken(token) {
  try {
    const result = await authCircuit.call(() => authService.verify(token));
    if (result.fromStaleCache) {
      // Still valid — we're using cached auth data from within 5 minutes
      return result.valid;
    }
    return result.valid;
  } catch (err) {
    // Circuit open with no usable cache — reject the request
    return false;
  }
}

Half-open probe attack surface

The half-open state is the most dangerous for security-critical circuits. The circuit allows one probe request through to test if the downstream has recovered. If that probe is a real user request carrying real credentials or performing a real action, an attacker can time their attack to the half-open window.

// Guard: half-open probe should use a synthetic health-check, not a real request
class SecureCircuitBreaker extends SecurityCircuitBreaker {
  constructor(opts) {
    super(opts);
    this.healthCheckFn = opts.healthCheck; // synthetic ping, not a real operation
    this.halfOpenProbeInFlight = false;
  }

  async call(fn) {
    if (this.state === STATES.OPEN) {
      const elapsed = Date.now() - this.openedAt;
      if (elapsed >= this.recoveryTimeout && !this.halfOpenProbeInFlight) {
        // Run health check in background — don't let real request be the probe
        this.halfOpenProbeInFlight = true;
        this.healthCheckFn()
          .then(() => { this.state = STATES.HALF_OPEN; })
          .catch(() => { this.openedAt = Date.now(); }) // reset timer
          .finally(() => { this.halfOpenProbeInFlight = false; });
      }
      // While probe is in flight or circuit is still open: deny
      return this._handleOpen();
    }
    // ... rest of call logic
  }
}

Cascade failure blast radius

When a circuit opens, every subsequent request hits the fail-closed path. If the fail-closed path itself does significant work (database lookups, secondary auth calls), a cascade of open circuits can create a resource exhaustion secondary attack. The fail-closed path should be minimal: a cache lookup and a deny response, not another round-trip.

SkillAudit detection

SkillAudit's Security axis checks for circuit breaker patterns in MCP servers that make external calls: whether the circuit breaker default is fail-open or fail-closed for calls that gate security decisions, and whether the circuit state affects authentication or input validation paths. Servers with try/catch blocks around auth calls that silently swallow errors (effectively fail-open) are flagged under the Authentication Bypass risk category.

Run a SkillAudit scan to identify which external-dependency calls in your MCP server are wired to fail-open and would be security-degraded under a targeted availability attack.


Related: Fail-secure patterns · Rate limiting security · Rate limiting deep dive