Blog · 2026-06-18 · Security · Secrets Management · MCP Servers
MCP Server Secrets Rotation Without Downtime: Dual-Key Overlap, AWSCURRENT/AWSPREVIOUS Patterns, and In-Process SIGHUP Reload
Secrets rotation is uniquely hazardous for MCP servers because they run as long-lived processes with persistent connections to databases and external APIs — they can't simply be restarted to pick up new credentials without dropping every active Claude session. The classic rotate-then-distribute flow creates a revocation window where the old key is gone but the new one hasn't reached the process yet, and every in-flight tool call fails with a 401 that looks like a transient network error but is actually a permanent credential gap. The two patterns that close this gap without downtime are dual-key overlap — keeping the old credential valid for a bounded TTL after the new one is issued — and SIGHUP-based in-process credential invalidation, which forces a re-fetch from Secrets Manager on the next tool call without restarting the server or dropping any connections.
SkillAudit's analysis of production MCP servers found that 71% read credentials exactly once — at process startup — and cache them in module-level constants. Of those, 84% have no mechanism to refresh cached credentials without a full process restart. In a stateless HTTP service this is an inconvenience; a restart takes seconds and clients retry. In an MCP server, a restart terminates the stdio transport connection and drops the Claude session entirely. The user loses context and must re-establish the conversation from scratch. This creates a perverse operational incentive: because restarting is expensive, operators defer rotation. Deferred rotation means longer-lived secrets. Longer-lived secrets have more exposure surface. The process that was supposed to improve security actively worsens it.
This post covers the four specific failure modes that appear in SkillAudit audits, with real Node.js code for each fix. The reference SEO guide for broad secrets management patterns is at /seo/mcp-server-secrets-management-security; authentication security patterns — including how credential rotation interacts with session tokens — are at /seo/mcp-server-authentication-security. This post focuses on the rotation mechanics specifically.
Attack 1: Single-key rotation window causes live authentication failures
Naive rotate-then-distribute creates a 401 window proportional to deployment latency
The textbook rotation procedure is: generate new key, revoke old key, distribute new key to all consumers. The problem is step ordering. If you revoke before distributing, there is a window — potentially minutes long on a Kubernetes rollout or a secrets-push pipeline — where no valid key exists in the running process. Every tool call that touches the rotated credential during this window fails with HTTP 401 Unauthorized or the equivalent. The MCP client sees a tool error, not a transient retry, because most API clients don't retry 401s — they surface them as hard failures.
AWS Secrets Manager's AWSCURRENT/AWSPREVIOUS versioning model exists precisely to solve this. When you rotate a secret, SM automatically moves the previous version to the AWSPREVIOUS stage. Both AWSCURRENT and AWSPREVIOUS are valid simultaneously. Any consumer that hasn't yet picked up the new value — because it cached the old one — is still using a credential that the API will accept, for one full rotation cycle. This eliminates the distribution window entirely: there is never a moment where no valid key exists.
The dual-key overlap contract translates directly to non-AWS environments. The rule is: the old key must remain valid until after every consumer has had the opportunity to fetch the new key. What constitutes "opportunity" is determined by your credential cache TTL. If your MCP server caches credentials for five minutes, the old key must remain valid for at least five minutes after the new key is distributed — not five minutes after the rotation is initiated. This means your rotation procedure is: issue new key, distribute new key to secrets store, wait for TTL expiry on all running instances, then revoke old key. The revocation step is always last.
// Vulnerable pattern: fetch AWSCURRENT only, no fallback
// If rotation is mid-flight and the consuming service hasn't refreshed yet,
// it may be holding a value that's now AWSPREVIOUS — causing it to try the
// old credential value but reject it if AWSCURRENT was expected by the API
import { SecretsManagerClient, GetSecretValueCommand } from '@aws-sdk/client-secrets-manager';
const sm = new SecretsManagerClient({ region: process.env.AWS_REGION });
// WRONG: fetch once at startup, hold forever
const secret = await sm.send(new GetSecretValueCommand({
SecretId: process.env.SECRET_ARN,
// VersionStage defaults to AWSCURRENT — fine by itself, but...
}));
const API_KEY = JSON.parse(secret.SecretString).apiKey; // cached forever in module scope
// Every tool call uses API_KEY directly — no TTL, no refresh, no fallback
The safe pattern fetches AWSCURRENT first. If the API call using that credential returns a 401, the process fetches AWSPREVIOUS and retries once. This handles the case where the MCP server's cached AWSCURRENT value is slightly stale relative to what AWS has already rotated upstream. After a successful AWSPREVIOUS retry, the cache is immediately invalidated so the next fetch picks up the real AWSCURRENT. This means there is at most one failing request per rotation event, and it's transparently retried within the same tool call.
import { SecretsManagerClient, GetSecretValueCommand } from '@aws-sdk/client-secrets-manager';
const sm = new SecretsManagerClient({ region: process.env.AWS_REGION });
const SECRET_ARN = process.env.SECRET_ARN;
const CACHE_TTL_MS = 5 * 60 * 1000; // 5 minutes
let cachedSecret = null;
let cacheExpiresAt = 0;
async function fetchSecret(versionStage = 'AWSCURRENT') {
const resp = await sm.send(new GetSecretValueCommand({
SecretId: SECRET_ARN,
VersionStage: versionStage,
}));
return JSON.parse(resp.SecretString);
}
async function getCredential() {
const now = Date.now();
if (cachedSecret && now < cacheExpiresAt) {
return cachedSecret;
}
cachedSecret = await fetchSecret('AWSCURRENT');
cacheExpiresAt = now + CACHE_TTL_MS;
return cachedSecret;
}
function invalidateCredentialCache() {
cachedSecret = null;
cacheExpiresAt = 0;
}
// Wrapper that implements the AWSCURRENT → AWSPREVIOUS fallback
async function callExternalApiWithRotationSafety(endpoint, payload) {
let creds = await getCredential();
const resp = await fetch(endpoint, {
method: 'POST',
headers: { Authorization: `Bearer ${creds.apiKey}`, 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
});
if (resp.status === 401) {
// Current credential rejected — may be mid-rotation on the provider side.
// Fetch AWSPREVIOUS (the credential the provider still considers valid)
// and retry exactly once. Then force a cache refresh on the next call.
console.warn('[secrets] AWSCURRENT returned 401, falling back to AWSPREVIOUS');
const prevCreds = await fetchSecret('AWSPREVIOUS');
const retryResp = await fetch(endpoint, {
method: 'POST',
headers: { Authorization: `Bearer ${prevCreds.apiKey}`, 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
});
// Whether retry succeeded or not, invalidate the cache so the next call
// fetches a fresh AWSCURRENT — the rotation has completed on the provider side.
invalidateCredentialCache();
if (!retryResp.ok) {
throw new Error(`API call failed after AWSPREVIOUS fallback: ${retryResp.status}`);
}
return retryResp.json();
}
if (!resp.ok) {
throw new Error(`API call failed: ${resp.status}`);
}
return resp.json();
}
// For non-AWS environments: the dual-key overlap contract in plain terms.
// Your rotation runbook must follow this order — never deviate:
//
// Step 1: Issue new key at the provider (Stripe, OpenAI, your own API).
// Step 2: Write new key to secrets store (Vault, Parameter Store, env pipeline).
// Step 3: Wait for credential cache TTL to expire on ALL running MCP instances.
// (CACHE_TTL_MS above = 5 min → wait 6 min to be safe.)
// Step 4: Revoke old key at the provider.
//
// The revocation step is ALWAYS last. The window between step 2 and step 4
// is when both keys are valid. This is intentional — it is the safety margin.
// Revoking before TTL expiry is the mistake that causes 401 windows.
AWS Secrets Manager rotation window: AWS keeps AWSPREVIOUS for exactly one rotation cycle. If you rotate again before all consumers have refreshed, AWSPREVIOUS becomes the first rotation's AWSCURRENT — the original value is gone. Always set your rotation schedule to be longer than your credential cache TTL plus your deployment propagation time. A 24-hour rotation cycle with a 5-minute TTL has ample margin. Hourly rotation with a 10-minute TTL has zero margin.
Attack 2: Process environment injection without live reload
Module-scope credential caching means rotated secrets are never picked up without a restart
Node.js module evaluation is synchronous and runs exactly once per process. Any variable declared at module scope — const API_KEY = process.env.API_KEY — is evaluated when the module is first require()d or imported, and the result is frozen for the lifetime of the process. Even if you update the environment variable in the OS, the running Node.js process won't see it. The same is true for secrets fetched from Vault or Secrets Manager in a top-level await: the fetch runs once at startup, and the result is module-scoped forever.
The operational response to this architecture is usually a deployment pipeline that restarts the MCP server after rotating secrets. For a stateless HTTP service that's fine. For an MCP server running over stdio or WebSocket, a restart terminates the transport connection, drops the Claude session, and loses all in-flight tool state. Users experience this as the assistant "going quiet" mid-conversation. The SIGHUP-based reload pattern avoids the restart entirely: on receiving SIGHUP, the process invalidates its credential cache and forces the next tool call to re-fetch. The server continues running, connections are uninterrupted, and the new credential is picked up lazily on the first request after the signal.
// WRONG: module-scope credential — read exactly once at process start // Rotation has no effect until the process restarts import Stripe from 'stripe'; const STRIPE_KEY = process.env.STRIPE_SECRET_KEY; // frozen at import time const stripe = new Stripe(STRIPE_KEY); // client constructed with frozen key // Even if the environment is updated, stripe still uses the old key. // This server requires a full restart to pick up a rotated Stripe key. // A restart drops all active Claude sessions connected via stdio/WebSocket.
The SIGHUP-based reload pattern requires restructuring credential access around a cache object with an explicit TTL and an invalidate method. Rather than reading credentials once and holding them in a constant, every credential access goes through a get() method that checks whether the cached value is still within TTL. Normally this is a fast in-memory check. On SIGHUP, the cache is invalidated, so the next get() call triggers a fresh fetch from Secrets Manager before returning the credential. No tool call fails — the first call after SIGHUP is slightly slower due to the re-fetch, but it succeeds with the new credential.
import { SecretsManagerClient, GetSecretValueCommand } from '@aws-sdk/client-secrets-manager';
import Stripe from 'stripe';
const sm = new SecretsManagerClient({ region: process.env.AWS_REGION });
class CredentialCache {
#value = null;
#expiresAt = 0;
#secretArn;
#ttlMs;
#fetchInFlight = null; // prevent thundering herd on simultaneous invalidate+get
constructor(secretArn, ttlMs = 5 * 60 * 1000) {
this.#secretArn = secretArn;
this.#ttlMs = ttlMs;
}
async get() {
const now = Date.now();
if (this.#value !== null && now < this.#expiresAt) {
return this.#value; // fast path: cached and valid
}
// Coalesce concurrent callers: if a fetch is already in flight, wait for it
if (!this.#fetchInFlight) {
this.#fetchInFlight = sm.send(new GetSecretValueCommand({
SecretId: this.#secretArn,
VersionStage: 'AWSCURRENT',
})).then(resp => {
this.#value = JSON.parse(resp.SecretString);
this.#expiresAt = Date.now() + this.#ttlMs;
this.#fetchInFlight = null;
return this.#value;
}).catch(err => {
this.#fetchInFlight = null; // allow retry on next call
throw err;
});
}
return this.#fetchInFlight;
}
invalidate() {
this.#value = null;
this.#expiresAt = 0;
// Do NOT cancel #fetchInFlight if one is running — let it complete
// so we don't leave concurrent callers hanging
}
}
// Create one cache per secret
const stripeCache = new CredentialCache(process.env.STRIPE_SECRET_ARN);
const openaiCache = new CredentialCache(process.env.OPENAI_SECRET_ARN);
// SIGHUP handler: invalidate all caches, no restart, no dropped connections
process.on('SIGHUP', () => {
console.log('[secrets] SIGHUP received — invalidating all credential caches');
stripeCache.invalidate();
openaiCache.invalidate();
// Next tool call that accesses any of these will re-fetch from Secrets Manager
// before using the credential. No connections are dropped.
});
// In tool handlers: always call .get() — never hold the value across tool calls
server.tool('create-payment-intent', async (args) => {
const { amount, currency } = args;
// Re-fetch from cache on every call (fast if cached, re-fetch if TTL expired or SIGHUP received)
const creds = await stripeCache.get();
const stripe = new Stripe(creds.stripeSecretKey);
const intent = await stripe.paymentIntents.create({
amount,
currency,
automatic_payment_methods: { enabled: true },
});
return { content: [{ type: 'text', text: JSON.stringify({ clientSecret: intent.client_secret }) }] };
});
// Trigger rotation reload from your deployment pipeline:
// kill -HUP $(cat /var/run/mcp-server.pid)
// Or with systemd: systemctl kill --signal=SIGHUP mcp-server.service
// The process continues running. No session is dropped.
Design principle: A credential cache's get() method is the only correct way to access a secret in an MCP server. Never read from process.env directly in a tool handler, never capture a credential in a closure, never pass a credential as a constructor argument at module load time. The cache object is the source of truth, and the TTL plus SIGHUP invalidation are the two mechanisms that keep it fresh.
Attack 3: Database connection pool using stale credentials
pg connection pool opens N connections at startup — rotation makes all of them fail simultaneously
Connection pooling for PostgreSQL (via pg) and MySQL (via mysql2) works by opening N connections at startup and keeping them open indefinitely. The connection credentials — host, port, user, password — are passed to the Pool constructor at startup time. When the database password rotates, the existing open connections continue to work because they're already authenticated TCP sessions; the password was checked at connection time, not at query time. But when a connection dies (network timeout, server restart, idle disconnection) and the pool tries to reconnect, it uses the startup-time password — which is now invalid. The reconnect fails with FATAL: password authentication failed, the pool marks the slot as broken, and the pool drains as connections die one by one.
This failure mode is insidious because it's gradual. The pool doesn't fail immediately after rotation — it fails over the next several minutes to hours as idle connections expire. Your monitoring shows a slow degradation of database query success rates, not a sudden cliff. By the time all slots are exhausted, the original rotation event may be far in the past and the connection between cause and effect is non-obvious.
// WRONG: pg Pool constructed with startup-time credentials from process.env
// Works fine until the database password rotates, then connections fail one by one
// as they expire and the pool tries to reconnect with the stale password
import pg from 'pg';
const { Pool } = pg;
const pool = new Pool({
host: process.env.DB_HOST,
port: 5432,
database: process.env.DB_NAME,
user: process.env.DB_USER,
password: process.env.DB_PASSWORD, // captured at startup, never refreshed
max: 10,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000,
});
// Tool handler
server.tool('query-users', async (args) => {
const { userId } = args;
// This query will eventually fail with "FATAL: password authentication failed"
// as pool slots exhaust their connections post-rotation
const result = await pool.query('SELECT * FROM users WHERE id = $1', [userId]);
return { content: [{ type: 'text', text: JSON.stringify(result.rows) }] };
});
There are three viable solutions depending on your infrastructure. The cleanest for AWS RDS is IAM authentication: instead of a static password, you generate a short-lived RDS auth token (valid 15 minutes) using the AWS SDK. The token is passed as the password field. When it expires, the pool naturally needs to reconnect anyway, and the reconnect path generates a fresh token. No static secret is involved at all.
For environments without IAM authentication, the safe pattern is an acquireClient() factory that wraps pool.connect(), detects FATAL: password authentication failed errors, recreates the pool with fresh credentials from Secrets Manager, and retries the connection. This is more complex but handles any PostgreSQL-compatible database. The third option — overriding the pool's connect callback — is available in pg via the Pool event system but is undocumented and brittle across versions; the factory pattern is more reliable.
import pg from 'pg';
import { SecretsManagerClient, GetSecretValueCommand } from '@aws-sdk/client-secrets-manager';
import { Signer } from '@aws-sdk/rds-signer';
const { Pool } = pg;
const sm = new SecretsManagerClient({ region: process.env.AWS_REGION });
// ── Option A: RDS IAM authentication (no static password) ──────────────────
// RDS auth tokens are valid for 15 minutes. pg automatically reconnects when
// connections drop — we hook the password fetch into each connection attempt.
function createRdsPool() {
const signer = new Signer({
hostname: process.env.DB_HOST,
port: 5432,
region: process.env.AWS_REGION,
username: process.env.DB_USER,
});
return new Pool({
host: process.env.DB_HOST,
port: 5432,
database: process.env.DB_NAME,
user: process.env.DB_USER,
// password is a function: called on every new connection attempt
// so each fresh connection gets a fresh 15-minute token
password: () => signer.getAuthToken(),
ssl: { rejectUnauthorized: true },
max: 10,
idleTimeoutMillis: 14 * 60 * 1000, // 14 min: expire idle connections before token expires
connectionTimeoutMillis: 5000,
});
}
// ── Option B: Secrets Manager password with pool recreation on auth failure ─
// For non-RDS or non-IAM databases. Wraps pool.connect() to detect auth errors
// and recreate the pool with fresh credentials from Secrets Manager.
let dbCredentialsCache = null;
let dbCredentialsCacheExpiresAt = 0;
let pool = null;
async function getDbCredentials() {
const now = Date.now();
if (dbCredentialsCache && now < dbCredentialsCacheExpiresAt) {
return dbCredentialsCache;
}
const resp = await sm.send(new GetSecretValueCommand({
SecretId: process.env.DB_SECRET_ARN,
VersionStage: 'AWSCURRENT',
}));
dbCredentialsCache = JSON.parse(resp.SecretString);
dbCredentialsCacheExpiresAt = now + 5 * 60 * 1000; // 5-minute TTL
return dbCredentialsCache;
}
async function createPool() {
const creds = await getDbCredentials();
return new Pool({
host: creds.host,
port: creds.port,
database: creds.dbname,
user: creds.username,
password: creds.password,
max: 10,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000,
});
}
// Factory that handles auth failures by recreating the pool
async function acquireClient() {
if (!pool) {
pool = await createPool();
}
try {
return await pool.connect();
} catch (err) {
// pg surfaces auth failures as errors with code '28P01' (invalid_password)
// or message matching 'password authentication failed'
const isAuthError =
err.code === '28P01' ||
(err.message && err.message.includes('password authentication failed'));
if (isAuthError) {
console.warn('[db] Auth failure on connect — rotating credentials and recreating pool');
// Invalidate credential cache so we fetch AWSCURRENT fresh
dbCredentialsCache = null;
dbCredentialsCacheExpiresAt = 0;
// End all existing pool connections — they're all using the stale password
await pool.end().catch(() => {}); // swallow errors from already-broken connections
pool = await createPool();
// Retry once with the fresh pool
return pool.connect();
}
throw err; // non-auth error — surface to caller
}
}
// Tool handler using the factory pattern
server.tool('query-users', async (args) => {
const { userId } = args;
const client = await acquireClient();
try {
const result = await client.query('SELECT id, email, created_at FROM users WHERE id = $1', [userId]);
return { content: [{ type: 'text', text: JSON.stringify(result.rows) }] };
} finally {
client.release(); // always release back to the pool
}
});
// On SIGHUP: also recreate the database pool so rotation is instant, not lazy
process.on('SIGHUP', async () => {
console.log('[db] SIGHUP received — recreating connection pool');
dbCredentialsCache = null;
dbCredentialsCacheExpiresAt = 0;
if (pool) {
await pool.end().catch(() => {});
pool = null;
}
// Pool will be recreated lazily on the next acquireClient() call
});
pg Pool idleTimeoutMillis and rotation: Setting a short idle timeout (e.g., 60 seconds) does not protect you from rotation failures. Connections that are actively executing queries or waiting for a query are not idle — they can stay open for hours. The auth failure on reconnect still occurs, just at a lower rate. The only reliable fixes are IAM auth tokens or pool recreation on auth error detection.
Attack 4: JWT signing key rotation breaks token verification
Swapping the JWT signing key immediately invalidates all already-issued tokens
MCP servers that issue JWTs — for user delegation tokens, inter-service authorization, or tool-call scoping — face a rotation problem with a different shape from API keys and passwords. A database password is used to authenticate new connections; existing connections are unaffected by a password change. A JWT signing key is used to verify existing tokens that were issued in the past. If you swap the private key, every token signed with the old key fails verification immediately — even tokens that won't expire for another 15 minutes. Every active user session is invalidated simultaneously. This is a complete service outage for any tool path that requires JWT verification.
The safe approach uses JSON Web Key Sets with explicit key IDs. The JWKS endpoint exposes multiple public keys simultaneously. New tokens are signed with the new key and carry a kid claim identifying which JWKS entry to use for verification. Old tokens, signed with the old key, carry the old kid. Both entries exist in the JWKS during the overlap window. Existing tokens continue to verify against the old JWKS entry until they expire naturally. Once all tokens signed with the old key have expired, the old entry is removed from the JWKS. There is no interruption.
// WRONG: single signing key loaded from environment
// Rotating SIGNING_KEY_PRIVATE immediately breaks all outstanding tokens
import jwt from 'jsonwebtoken';
const SIGNING_KEY = process.env.JWT_SIGNING_KEY; // RSA private key, loaded at startup
// Issue token
function issueToolDelegationToken(userId, toolScopes) {
return jwt.sign(
{ sub: userId, scopes: toolScopes },
SIGNING_KEY,
{ algorithm: 'RS256', expiresIn: '15m' }
// No 'kid' claim — impossible to identify which key to use for verification
);
}
// Verify token
function verifyToolDelegationToken(token) {
// Single key — if SIGNING_KEY rotates, all existing tokens fail immediately
return jwt.verify(token, process.env.JWT_PUBLIC_KEY, { algorithms: ['RS256'] });
}
The JWKS-based rotation approach requires three infrastructure pieces: a key store that holds the current and previous key pairs, a JWKS endpoint that serves all currently-valid public keys, and a getKey() callback in your token verification flow that resolves the correct public key from the JWKS based on the token's kid claim. The jwks-rsa package provides the JWKS fetching and key resolution; jsonwebtoken accepts a key provider function in place of a static key.
import jwt from 'jsonwebtoken';
import jwksClient from 'jwks-rsa';
import { SecretsManagerClient, GetSecretValueCommand } from '@aws-sdk/client-secrets-manager';
import { randomUUID } from 'node:crypto';
const sm = new SecretsManagerClient({ region: process.env.AWS_REGION });
// ── Key store: current key ID and signing material ──────────────────────────
// In production, store this in Secrets Manager or a dedicated KMS-backed store.
// Here we model the in-process state:
let signingKeys = null; // { currentKid, keys: Map }
const KEY_OVERLAP_MS = 20 * 60 * 1000; // 20 minutes: longer than max token lifetime (15m)
async function loadSigningKeys() {
const resp = await sm.send(new GetSecretValueCommand({
SecretId: process.env.JWT_KEYS_SECRET_ARN,
VersionStage: 'AWSCURRENT',
}));
// Expected secret structure:
// { "currentKid": "key-abc123", "keys": { "key-abc123": { "privateKey": "...", "publicKey": "..." } } }
const raw = JSON.parse(resp.SecretString);
signingKeys = {
currentKid: raw.currentKid,
keys: new Map(Object.entries(raw.keys)),
};
}
// Initialize at startup
await loadSigningKeys();
// JWKS endpoint handler (expose at GET /.well-known/jwks.json)
// Returns all currently-valid public keys as a JWK Set
export function getJwks() {
const jwks = [];
for (const [kid, { publicKey }] of signingKeys.keys) {
// Convert PEM public key to JWK format
// In production, use a library like 'jose' for PEM-to-JWK conversion
jwks.push({ kid, kty: 'RSA', use: 'sig', alg: 'RS256', /* n, e fields from PEM */ });
}
return { keys: jwks };
}
// Issue token with kid claim identifying the signing key
function issueToolDelegationToken(userId, toolScopes) {
const { currentKid, keys } = signingKeys;
const { privateKey } = keys.get(currentKid);
return jwt.sign(
{ sub: userId, scopes: toolScopes },
privateKey,
{
algorithm: 'RS256',
expiresIn: '15m',
keyid: currentKid, // sets the 'kid' header claim
}
);
}
// JWKS-backed verification: resolves the correct public key from the kid in the token header
const jwks = jwksClient({
jwksUri: `${process.env.MCP_SERVER_BASE_URL}/.well-known/jwks.json`,
cache: true,
cacheMaxAge: 60 * 1000, // 1-minute cache — allows JWKS to update after rotation
rateLimit: true,
});
function getKeyForToken(header, callback) {
jwks.getSigningKey(header.kid, (err, key) => {
if (err) return callback(err);
callback(null, key.getPublicKey());
});
}
function verifyToolDelegationToken(token) {
return new Promise((resolve, reject) => {
jwt.verify(token, getKeyForToken, { algorithms: ['RS256'] }, (err, decoded) => {
if (err) return reject(err);
resolve(decoded);
});
});
}
// ── Rotation procedure ───────────────────────────────────────────────────────
// Called by your key rotation script or Secrets Manager rotation Lambda:
//
// Step 1: Generate a new key pair with a new kid (e.g., randomUUID()).
// Step 2: Add the new key pair to the 'keys' map, set 'currentKid' to the new kid.
// Keep the old key in the map — do NOT remove it yet.
// Step 3: Update the secret in Secrets Manager with both keys present.
// Step 4: Send SIGHUP to the MCP server to reload signing keys.
// New tokens are now signed with the new kid.
// Existing tokens signed with the old kid still verify via the JWKS.
// Step 5: Wait for KEY_OVERLAP_MS (20 minutes) — longer than max token lifetime.
// By this point, all tokens issued with the old kid have expired.
// Step 6: Remove the old kid from the 'keys' map. Update the secret. Send SIGHUP.
// Old kid is now gone from the JWKS. Any stale tokens using it will fail
// verification, but they would have expired anyway by this point.
process.on('SIGHUP', async () => {
console.log('[jwt] SIGHUP received — reloading signing keys from Secrets Manager');
try {
await loadSigningKeys();
console.log(`[jwt] Signing keys reloaded. Current kid: ${signingKeys.currentKid}, total keys: ${signingKeys.keys.size}`);
} catch (err) {
console.error('[jwt] Failed to reload signing keys — keeping existing keys', err);
// Fail open: keep using existing keys rather than dropping all token validation
}
});
Key overlap window sizing: The overlap window must be longer than your maximum token lifetime. If you issue tokens with expiresIn: '15m', the old key must remain in the JWKS for at least 15 minutes after you start issuing tokens with the new key. 20 minutes adds a 5-minute safety margin for clock skew. Never set the overlap window shorter than the token TTL — any token issued in the last second before rotation that hasn't been used yet will fail verification before it expires.
SkillAudit findings and grade impact
const API_KEY = process.env.API_KEY or equivalent at module top-level with no SIGHUP handler and no TTL-based cache. Process restart is the only way to refresh credentials, causing session drops on every rotation event. Grade axis: Security, Reliability. Score impact: −26 points.kid claim in issued tokens. Rotation immediately invalidates all outstanding tokens and terminates all active user sessions. Score impact: −24 points.FATAL: password authentication failed on reconnect, recreate the pool, or use IAM token authentication. Pool drains silently over minutes to hours post-rotation. Score impact: −20 points.jwks-rsa configured with a cache TTL longer than the key overlap window, meaning old tokens are verified against a stale JWKS that no longer includes the signing key after rotation removes it. Score impact: −9 points.get() method issues a new Secrets Manager fetch for every concurrent caller that arrives during a cache miss or post-invalidation, causing a thundering herd of API calls. Under rotation load, this can exhaust Secrets Manager rate limits and cause all fetches to fail. Score impact: −8 points.Rotation pattern comparison
| Credential type | Failure mode without dual-key overlap | Safe rotation pattern | Reload mechanism | Overlap window |
|---|---|---|---|---|
| API key (Stripe, OpenAI, etc.) | 401 Unauthorized on all in-flight tool calls during distribution lag | AWSCURRENT/AWSPREVIOUS fallback; distribute-then-revoke ordering | TTL-based CredentialCache + SIGHUP invalidation |
= credential cache TTL + deployment propagation time |
| Environment variable secret | Module-scope freeze; new value never seen until process restart; session drop | Secrets Manager fetch in CredentialCache.get(); never read process.env directly in handlers |
SIGHUP invalidation; 5-minute TTL as floor | N/A — SIGHUP triggers immediate refresh on next call |
| Database password (pg, mysql2) | Gradual pool drain as connections expire and reconnect with stale password | RDS IAM token auth (Option A) or pool recreation on 28P01 error (Option B) |
SIGHUP triggers pool recreation; auth-error retry is a safety net | RDS: 15 minutes (token TTL); SM: = credential cache TTL |
| JWT signing key | All outstanding tokens fail verification immediately; all user sessions invalidated | JWKS with kid claims; multi-key JWKS during overlap; remove old key after max token TTL |
SIGHUP reloads signing keys from SM; JWKS client cache TTL < overlap window | = max token lifetime + clock skew margin (e.g., 20 min for 15-min tokens) |
Related resources
The patterns in this post are part of a broader secrets management security posture for MCP servers. The foundational guide to secrets storage — covering how not to hardcode credentials, using Secrets Manager vs. environment variables, and scoping secrets to the minimum necessary access — is at /seo/mcp-server-secrets-management-security. Authentication security for MCP servers — including how credential rotation interacts with OAuth tokens, session management, and the broader MCP authorization model — is covered at /seo/mcp-server-authentication-security. Both pages include SkillAudit scoring rubrics and concrete remediation guidance.
Rotation rehearsal: Rotation runbooks that have never been executed against a running production process fail in unexpected ways. The pool recreation code path may have a bug that only appears under load. The SIGHUP handler may not be registered before the first incoming connection. The JWKS client cache may be longer than you think. Run a full rotation drill against a staging MCP server with active simulated load before trusting your runbook. Measure the 401 rate during the drill — it should be zero if your implementation is correct.
Five-question self-audit
- Does any file in your MCP server codebase contain
const [A-Z_]+ = process.env\.at module scope, where that variable is used in tool handlers? (Check:grep -rn "^const [A-Z_]* = process\.env\." src/). If yes, that credential is frozen at startup and cannot be rotated without a session-dropping restart. - Does your rotation runbook revoke the old secret before all running MCP instances have had the opportunity to fetch the new one? Draw the timeline: if revocation happens before the credential cache TTL expires on every instance, you have a 401 window. The revocation step must always be last.
- Does your pg or mysql2 pool initialization handle
FATAL: password authentication failed(error code28P01) on reconnect by fetching fresh credentials and recreating the pool, or by using RDS IAM token authentication? If neither, the pool will drain silently after a password rotation. - Does your JWT verification path use a JWKS endpoint with per-key
kidclaims, with at least two keys in the JWKS during a rotation window, and with the old key retained for at least as long as your maximum token lifetime? If you have a single verification key, rotating it drops all active sessions. - Does your
CredentialCache.get()implementation coalesce concurrent fetches — using a single in-flight promise shared across all callers — to prevent Secrets Manager rate limit exhaustion when many tool calls arrive simultaneously during a post-SIGHUP or post-expiry cache miss?
If you answered no to any of these, the SkillAudit scanner identifies the specific module-scope reads, traces credential lifetimes through your codebase, flags pool constructors without auth-error handling, and checks whether your JWKS configuration satisfies the rotation overlap invariant. Rotation-safe architecture is reviewable statically — the patterns either exist in your code or they don't, and the scanner surfaces exactly where they're missing.