Concurrency Security · June 2026 · 20 min read
MCP Server Race Conditions Beyond Shared Memory: Distributed Locking, Optimistic Concurrency, and Idempotency Keys for Multi-Instance Deployments
Most MCP server race condition guides stop at in-process mutexes — a Mutex.lock() that serializes concurrent tool calls within a single Node.js process. That's fine for a single instance. The moment you scale to two instances behind a load balancer, every in-process lock becomes useless: the two processes share no memory, so each acquires its own "lock" and both proceed. This post covers the real distributed concurrency stack: Redis-based distributed locks with fencing tokens, Redlock for multi-Redis high availability, optimistic concurrency control with version columns as a lock-free alternative, and the idempotency key pattern that makes tool calls safe to retry without double-executing side effects.
Why in-process locks fail in multi-instance MCP deployments
A typical MCP server serving concurrent agent sessions looks like this in production: two or more Node.js processes behind a load balancer, sharing a single PostgreSQL database. An agent with two parallel sub-agents may route both tool calls to different instances. If both sub-agents call transferBudget(projectId, amount) simultaneously:
Instance A: Receives transferBudget(project-42, 500). Acquires its own Mutex.lock(). Reads current balance: 1000.
Instance B: Receives transferBudget(project-42, 500). Acquires its own Mutex.lock() (different process, different mutex object). Reads current balance: 1000 — the write from A hasn't committed yet.
Instance A: Writes new_balance = 1000 - 500 = 500. Commits. Releases mutex.
Instance B: Writes new_balance = 1000 - 500 = 500 — using the stale read. Commits. Releases mutex. The transfer ran twice but only one deduction was recorded.
The in-process mutex did exactly what it was designed to do: it serialized concurrent calls within each process. But it provided zero coordination across the process boundary. The bug is silent — no exception, no error log, just a balance that is $500 too high.
This is not a theoretical edge case. Agent frameworks that execute multiple tool calls in parallel routinely produce concurrent requests to the same MCP server resource. If your MCP server does any write that is conditioned on a read (read-modify-write), multi-instance deployment makes it a race condition by default.
Approach 1: Redis distributed lock with fencing tokens
The simplest distributed lock uses Redis's atomic SET key value NX EX seconds command. NX means "only set if not exists" — making acquisition atomic. EX sets the expiry so a crashed holder doesn't hold the lock forever. The lock value is a random token (the fencing token) that only the holder knows; this prevents one instance from accidentally releasing another instance's lock.
import { createClient } from 'redis';
import { randomBytes } from 'crypto';
const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();
async function acquireLock(
resource: string,
ttlMs: number
): Promise<string | null> {
const token = randomBytes(16).toString('hex');
const key = `lock:${resource}`;
// SET key token NX PX ttlMs — atomic: set only if not exists, expire after ttlMs
const result = await redis.set(key, token, { NX: true, PX: ttlMs });
return result === 'OK' ? token : null;
}
async function releaseLock(resource: string, token: string): Promise<boolean> {
const key = `lock:${resource}`;
// Lua script: compare-and-delete atomically — prevents releasing another holder's lock
const script = `
if redis.call("GET", KEYS[1]) == ARGV[1] then
return redis.call("DEL", KEYS[1])
else
return 0
end
`;
const result = await redis.eval(script, { keys: [key], arguments: [token] });
return result === 1;
}
// Usage in a tool handler
async function transferBudget(projectId: string, amount: number) {
const lockResource = `budget:${projectId}`;
const lockTtl = 5000; // 5 seconds
const token = await acquireLock(lockResource, lockTtl);
if (!token) {
throw new Error('Resource is locked by another operation; retry after 1s');
}
try {
const balance = await db.query(
'SELECT balance FROM projects WHERE id = $1 FOR UPDATE', [projectId]
);
if (balance.rows[0].balance < amount) throw new Error('Insufficient balance');
await db.query(
'UPDATE projects SET balance = balance - $1 WHERE id = $2',
[amount, projectId]
);
} finally {
await releaseLock(lockResource, token);
}
}
The TTL is a safety net, not the primary release path. Always release the lock in a finally block. The TTL prevents deadlocks if the process crashes while holding the lock — but it must be longer than your critical section could ever take, or a slow database operation will cause the lock to expire mid-section, allowing another instance to acquire it while you're still holding what you think is an exclusive lock.
The fencing token problem
The random token in the lock value solves a subtle bug: without it, a crashed-then-restarted process could issue a DEL lock:budget:42 that deletes the lock acquired by a different instance that grabbed it after the expiry. The Lua compare-and-delete makes release atomic — it checks that the key's value matches the token before deleting, so only the original acquirer can release it.
For storage operations (like database writes), you need an additional fencing token pattern: the lock token is passed as a write condition to the storage layer, which rejects writes from a token that arrived after a newer lock was issued. This defends against a specific failure mode called "lock validity expiry during GC pause" — the JVM or Node.js garbage collector pauses the process for 3 seconds, the lock expires, another instance acquires it, and when the first process resumes it believes it still holds the lock and proceeds to write. For most MCP server workloads, using the database's SELECT ... FOR UPDATE inside the Redis lock critical section provides the same protection: the database-level lock makes the read-modify-write atomic even if the Redis lock has expired.
Approach 2: Redlock for multi-Redis high availability
A single Redis instance is a single point of failure for all your locks. If it goes down, all lock acquisitions fail until it recovers — or, depending on your Redis configuration, there may be a window where the Redis master has accepted a lock that wasn't replicated to replicas before a failover, meaning two instances think they both hold the same lock. Redlock addresses this by acquiring the lock on N independent Redis instances and requiring a quorum (⌊N/2⌋+1) to consider the lock held.
import Redlock from 'redlock';
import { createClient } from 'redis';
const redisInstances = [
createClient({ url: process.env.REDIS_URL_1 }),
createClient({ url: process.env.REDIS_URL_2 }),
createClient({ url: process.env.REDIS_URL_3 }),
];
await Promise.all(redisInstances.map(r => r.connect()));
const redlock = new Redlock(redisInstances, {
// Retry 3 times, 200ms apart, with 100ms jitter to avoid convoy effect
retryCount: 3,
retryDelay: 200,
retryJitter: 100,
// Lock duration: must be greater than the critical section duration
// Set conservatively — we release explicitly in finally
automaticExtensionThreshold: 500, // extend if 500ms from expiry and still in critical section
});
async function transferBudgetRedlock(projectId: string, amount: number) {
// Acquire lock across all 3 Redis instances with 5s TTL
const lock = await redlock.acquire([`lock:budget:${projectId}`], 5000);
try {
const result = await db.query(
'SELECT balance FROM projects WHERE id = $1 FOR UPDATE',
[projectId]
);
const balance = result.rows[0].balance;
if (balance < amount) throw new Error('Insufficient balance');
await db.query(
'UPDATE projects SET balance = balance - $1 WHERE id = $2',
[amount, projectId]
);
} finally {
await lock.release();
}
}
Redlock is a good choice when you already have Redis Cluster or Redis Sentinel for your caching layer and can add two more standalone Redis nodes for lock quorum. For teams that don't operate Redis at this scale, optimistic concurrency (below) is often simpler to reason about.
Approach 3: Optimistic concurrency control with version columns
Pessimistic locking (Redis distributed lock) serializes all writers — if you have 100 concurrent tool calls competing for the same record, 99 of them are blocked waiting for the lock. Optimistic concurrency control (OCC) takes the opposite bet: allow all readers to proceed concurrently, but detect conflicts at write time using a version column. If the record was modified between your read and your write, the write is rejected and you retry. This is lock-free and scales much better under low-contention workloads.
// Schema: ALTER TABLE projects ADD COLUMN version INTEGER NOT NULL DEFAULT 0;
async function transferBudgetOCC(
projectId: string,
amount: number,
maxRetries = 5
): Promise<void> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
// Read with version
const result = await db.query(
'SELECT balance, version FROM projects WHERE id = $1',
[projectId]
);
const { balance, version } = result.rows[0];
if (balance < amount) throw new Error('Insufficient balance');
// Update only if version hasn't changed since our read
const updateResult = await db.query(
`UPDATE projects
SET balance = balance - $1, version = version + 1
WHERE id = $2 AND version = $3`,
[amount, projectId, version]
);
if (updateResult.rowCount === 1) {
return; // Success — we were the only writer during this window
}
// Conflict detected — another writer incremented the version; retry with backoff
const backoffMs = Math.min(50 * Math.pow(2, attempt), 1000);
await new Promise(resolve => setTimeout(resolve, backoffMs + Math.random() * 50));
}
throw new Error(`transferBudget: conflict persisted after ${maxRetries} retries`);
}
OCC is the right default for most MCP tool handlers. It requires no external infrastructure (just a version column in your schema), is trivially testable, and scales horizontally without a locking bottleneck. Reserve Redis distributed locks for operations where retry is expensive (like sending an email or charging a card) — and combine OCC with an idempotency key (below) for those cases.
Choosing between pessimistic and optimistic concurrency
| Criterion | Pessimistic (Redis lock) | Optimistic (version column) |
|---|---|---|
| Contention level | High — many concurrent writers on same record | Low to medium — conflicts are rare |
| Infrastructure required | Redis (or Redlock multi-Redis) | Version column in DB — no external infra |
| Retry cost | Low — lock holder does the work, others wait | Medium — each conflict requires a full retry |
| Side-effect safety | Lock prevents duplicate side effects naturally | Must combine with idempotency keys for side effects |
| Deadlock risk | Yes — across multiple resources; mitigated by TTL | No — no locks held across transactions |
| Scalability | Bottleneck at lock granularity | Scales horizontally with low contention |
Approach 4: The idempotency key pattern for exactly-once tool execution
Distributed locks and OCC prevent race conditions in database writes. But many MCP tool calls have side effects beyond the database: they send emails, charge credit cards, invoke third-party APIs, or enqueue jobs. These operations are often non-idempotent — calling them twice produces two emails, two charges, two jobs. The idempotency key pattern makes them safe to retry.
An idempotency key is a caller-supplied token (UUID) that the server uses to deduplicate requests. If the server receives two requests with the same idempotency key, it returns the result of the first request without executing the operation again.
// Schema:
// CREATE TABLE idempotency_keys (
// key TEXT PRIMARY KEY,
// response_body JSONB NOT NULL,
// created_at TIMESTAMPTZ DEFAULT NOW()
// );
// CREATE INDEX ON idempotency_keys (created_at); -- for TTL cleanup
async function sendInvoice(
customerId: string,
amount: number,
idempotencyKey: string
): Promise<{ invoiceId: string; sent: boolean }> {
// Check if we've seen this key before
const existing = await db.query(
'SELECT response_body FROM idempotency_keys WHERE key = $1',
[idempotencyKey]
);
if (existing.rows.length > 0) {
// Return the cached response — do NOT re-send the invoice
return existing.rows[0].response_body as { invoiceId: string; sent: boolean };
}
// Attempt to reserve the idempotency key before executing the side effect
// ON CONFLICT DO NOTHING handles the race where two instances try to insert simultaneously
const reservation = await db.query(
`INSERT INTO idempotency_keys (key, response_body)
VALUES ($1, $2::jsonb)
ON CONFLICT (key) DO NOTHING
RETURNING key`,
[idempotencyKey, JSON.stringify({ pending: true })]
);
if (reservation.rowCount === 0) {
// Another instance reserved this key between our SELECT and INSERT — they'll handle it
// Retry after a short wait to let them finish
await new Promise(r => setTimeout(r, 200));
return sendInvoice(customerId, amount, idempotencyKey); // tail recursion — max 1 retry expected
}
// We own the key — execute the side effect
let response: { invoiceId: string; sent: boolean };
try {
const invoiceId = await invoiceService.create(customerId, amount);
await emailService.send(customerId, invoiceId);
response = { invoiceId, sent: true };
} catch (err) {
// On failure, delete the idempotency key so the caller can retry
await db.query('DELETE FROM idempotency_keys WHERE key = $1', [idempotencyKey]);
throw err;
}
// Persist the response — future calls with the same key return this immediately
await db.query(
'UPDATE idempotency_keys SET response_body = $1::jsonb WHERE key = $2',
[JSON.stringify(response), idempotencyKey]
);
return response;
}
The reservation step is critical. The naive approach — check-then-execute-then-insert — has a TOCTOU race where two instances both see no existing key and both execute the side effect. The pattern above reserves the key (with ON CONFLICT DO NOTHING) before executing the side effect. Only the instance that wins the INSERT gets to proceed; the loser sees rowCount 0 and waits.
Idempotency key lifecycle
Idempotency keys need a TTL to prevent the table from growing without bound. A 24-hour or 7-day expiry is typical: long enough that a retry from a crashed agent will find the cached response, short enough to keep the table manageable. A scheduled job or a partial index with a WHERE created_at < NOW() - INTERVAL '7 days' handles cleanup.
The key must come from the caller, not be generated by the server. When an agent framework retries a failed tool call, it must use the same key it sent in the original attempt. If the server generates the key, a retry will generate a new key and the deduplication is lost.
Combining the patterns: the full multi-instance tool call stack
Most production MCP server tool handlers that mutate state need a combination of these patterns, not one in isolation:
Read-modify-write on database records
Use OCC (version column) as the default. For high-contention records, use Redis distributed lock + SELECT ... FOR UPDATE inside the critical section.
External API calls with side effects
Use idempotency keys passed by the caller. Store key + response in a dedupe table before executing the side effect. Delete key on failure so retries are permitted.
Job enqueueing (no duplicate jobs)
Use database-level uniqueness on job deduplication key. INSERT INTO jobs ... ON CONFLICT (dedup_key) DO NOTHING. Same pattern as idempotency keys.
In-process mutex only
Never sufficient for multi-instance deployments. Fine for single-instance development, but breaks the moment you add a second instance or a load balancer.
Testing distributed concurrency in MCP tool handlers
Race conditions are notoriously hard to reproduce in tests because they depend on timing. Two techniques help:
Interleaved execution tests — split your tool handler into the read and write phases and inject a pause between them to simulate preemption. Run two concurrent executions, verify that only one succeeds and the other either retries successfully or returns a conflict error:
test('OCC: concurrent transfers see conflict, one succeeds', async () => {
await db.query("INSERT INTO projects VALUES ('p1', 1000, 0)");
// Patch db.query to pause for 50ms after the SELECT, simulating the race window
let readCount = 0;
const origQuery = db.query.bind(db);
jest.spyOn(db, 'query').mockImplementation(async (sql, params) => {
const result = await origQuery(sql, params);
if (sql.includes('SELECT balance') && ++readCount === 1) {
await new Promise(r => setTimeout(r, 50)); // let instance B read before A writes
}
return result;
});
const [r1, r2] = await Promise.allSettled([
transferBudgetOCC('p1', 300),
transferBudgetOCC('p1', 300),
]);
// Both eventually succeed via retry, or one fails after max retries
const successes = [r1, r2].filter(r => r.status === 'fulfilled').length;
expect(successes).toBeGreaterThanOrEqual(1);
const balance = await db.query('SELECT balance FROM projects WHERE id = $1', ['p1']);
// Only one $300 deduction, not two
expect(balance.rows[0].balance).toBe(700);
});
Chaos injection — a more realistic approach is to instrument your tool handlers with configurable delay injection (controlled by an environment variable) and run a load test that fires 50 concurrent requests at the same resource. Assert that the final database state reflects exactly the expected number of mutations. Any deviation is a race condition.
SkillAudit findings for concurrency issues in MCP servers
DEL lock:key without checking the token value can delete another instance's lock if TTL expired during a slow critical sectionWHERE version = $N — the version check is never enforced; concurrent writers silently overwrite each otherDeployment checklist for MCP servers with shared state
- Every read-modify-write pattern uses either OCC (version column + WHERE version = N) or a Redis distributed lock with compare-and-delete release
- Redis lock TTL is at least 3× the P99 duration of the critical section — monitor P99 and adjust if the critical section slows under load
- External API calls with side effects accept an idempotency key from the caller and persist key + response before executing the side effect
- Idempotency keys are reserved (INSERT ON CONFLICT DO NOTHING) before the side effect, not after — the reservation prevents the TOCTOU race
- Failed operations delete their idempotency key reservation so callers can safely retry
- Idempotency key table has a TTL-based cleanup job (scheduled DELETE WHERE created_at < NOW() - INTERVAL '7 days')
- Concurrency tests inject delays in the read-write gap to reliably reproduce races — run them in CI
- Redlock used instead of single-Redis lock if Redis availability SLA is critical to lock correctness
Summary
The root cause of distributed race conditions in MCP servers is not missing locks — it's using the wrong scope of lock. In-process mutexes are correct for serializing concurrent goroutines or async tasks within a single process. They are completely ineffective for coordinating writes across multiple process instances. The fixes are well-understood: Redis SET NX PX with compare-and-delete release for pessimistic locking; version columns with WHERE version = N for optimistic concurrency; and idempotency keys with a reservation table for exactly-once side effects. Each pattern has a specific use case — OCC for low-contention read-modify-write, Redis lock for high-contention or multi-resource operations, idempotency keys for non-idempotent external calls — and the most resilient MCP servers use all three in the right places.
The failure mode without these patterns is silent data corruption: wrong balances, duplicate emails, double charges — bugs that don't appear in single-instance development environments and are detected only in production under concurrent agent load. SkillAudit checks for unguarded read-modify-write patterns, in-process-only mutex usage in multi-instance configurations, and external API calls inside retry loops without idempotency key handling as part of every audit. See a sample audit report or run a free audit on your MCP server.